23 resultados para High-dimensional Data
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
L’anàlisi de l’efecte dels gens i els factors ambientals en el desenvolupament de malalties complexes és un gran repte estadístic i computacional. Entre les diverses metodologies de mineria de dades que s’han proposat per a l’anàlisi d’interaccions una de les més populars és el mètode Multifactor Dimensionality Reduction, MDR, (Ritchie i al. 2001). L’estratègia d’aquest mètode és reduir la dimensió multifactorial a u mitjançant l’agrupació dels diferents genotips en dos grups de risc: alt i baix. Tot i la seva utilitat demostrada, el mètode MDR té alguns inconvenients entre els quals l’agrupació excessiva de genotips pot fer que algunes interaccions importants no siguin detectades i que no permet ajustar per efectes principals ni per variables confusores. En aquest article il•lustrem les limitacions de l’estratègia MDR i d’altres aproximacions no paramètriques i demostrem la conveniència d’utilitzar metodologies parametriques per analitzar interaccions en estudis cas-control on es requereix l’ajust per variables confusores i per efectes principals. Proposem una nova metodologia, una versió paramètrica del mètode MDR, que anomenem Model-Based Multifactor Dimensionality Reduction (MB-MDR). La metodologia proposada té com a objectiu la identificació de genotips específics que estiguin associats a la malaltia i permet ajustar per efectes marginals i variables confusores. La nova metodologia s’il•lustra amb dades de l’Estudi Espanyol de Cancer de Bufeta.
Resumo:
In this paper we introduce a highly efficient reversible data hiding system. It is based on dividing the image into tiles and shifting the histograms of each image tile between its minimum and maximum frequency. Data are then inserted at the pixel level with the largest frequency to maximize data hiding capacity. It exploits the special properties of medical images, where the histogram of their nonoverlapping image tiles mostly peak around some gray values and the rest of the spectrum is mainlyempty. The zeros (or minima) and peaks (maxima) of the histograms of the image tiles are then relocated to embed the data. The grey values of some pixels are therefore modified.High capacity, high fidelity, reversibility and multiple data insertions are the key requirements of data hiding in medical images. We show how histograms of image tiles of medical images can be exploited to achieve these requirements. Compared with data hiding method applied to the whole image, our scheme can result in 30%-200% capacity improvement and still with better image quality, depending on the medical image content. Additional advantages of the proposed method include hiding data in the regions of non-interest and better exploitation of spatial masking.
Resumo:
Background Computerised databases of primary care clinical records are widely used for epidemiological research. In Catalonia, the InformationSystem for the Development of Research in Primary Care (SIDIAP) aims to promote the development of research based on high-quality validated data from primary care electronic medical records. Objective The purpose of this study is to create and validate a scoring system (Registry Quality Score, RQS) that will enable all primary care practices (PCPs) to be selected as providers of researchusable data based on the completeness of their registers. Methods Diseases that were likely to be representative of common diagnoses seen in primary care were selected for RQS calculations. The observed/ expected cases ratio was calculated for each disease. Once we had obtained an estimated value for this ratio for each of the selected conditions we added up the ratios calculated for each condition to obtain a final RQS. Rate comparisons between observed and published prevalences of diseases not included in the RQS calculations (atrial fibrillation, diabetes, obesity, schizophrenia, stroke, urinary incontinenceand Crohn’s disease) were used to set the RQS cutoff which will enable researchers to select PCPs with research-usable data. Results Apart from Crohn’s disease, all prevalences were the same as those published from the RQS fourth quintile (60th percentile) onwards. This RQS cut-off provided a total population of 1 936 443 (39.6% of the total SIDIAP population). Conclusions SIDIAP is highly representative of the population of Catalonia in terms of geographical, age and sex distributions. We report the usefulness of rate comparison as a valid method to establish research-usable data within primary care electronic medical records
Resumo:
Vagueness and high dimensional space data are usual features of current data. The paper is an approach to identify conceptual structures among fuzzy three dimensional data sets in order to get conceptual hierarchy. We propose a fuzzy extension of the Galois connections that allows to demonstrate an isomorphism theorem between fuzzy sets closures which is the basis for generating lattices ordered-sets
Resumo:
Self-organizing maps (Kohonen 1997) is a type of artificial neural network developedto explore patterns in high-dimensional multivariate data. The conventional versionof the algorithm involves the use of Euclidean metric in the process of adaptation ofthe model vectors, thus rendering in theory a whole methodology incompatible withnon-Euclidean geometries.In this contribution we explore the two main aspects of the problem:1. Whether the conventional approach using Euclidean metric can shed valid resultswith compositional data.2. If a modification of the conventional approach replacing vectorial sum and scalarmultiplication by the canonical operators in the simplex (i.e. perturbation andpowering) can converge to an adequate solution.Preliminary tests showed that both methodologies can be used on compositional data.However, the modified version of the algorithm performs poorer than the conventionalversion, in particular, when the data is pathological. Moreover, the conventional ap-proach converges faster to a solution, when data is \well-behaved".Key words: Self Organizing Map; Artificial Neural networks; Compositional data
Resumo:
Graphical displays which show inter--sample distances are importantfor the interpretation and presentation of multivariate data. Except whenthe displays are two--dimensional, however, they are often difficult tovisualize as a whole. A device, based on multidimensional unfolding, isdescribed for presenting some intrinsically high--dimensional displays infewer, usually two, dimensions. This goal is achieved by representing eachsample by a pair of points, say $R_i$ and $r_i$, so that a theoreticaldistance between the $i$-th and $j$-th samples is represented twice, onceby the distance between $R_i$ and $r_j$ and once by the distance between$R_j$ and $r_i$. Self--distances between $R_i$ and $r_i$ need not be zero.The mathematical conditions for unfolding to exhibit symmetry are established.Algorithms for finding approximate fits, not constrained to be symmetric,are discussed and some examples are given.
Resumo:
Within last few years a new type of instruments called Terrestrial Laser Scanners (TLS) entered to the commercial market. These devices brought a possibility to obtain completely new type of spatial, three dimensional data describing the object of interest. TLS instruments are generating a type of data that needs a special treatment. Appearance of this technique made possible to monitor deformations of very large objects, like investigated here landslides, with new quality level. This change is visible especially with relation to the size and number of the details that can be observed with this new method. Taking into account this context presented here work is oriented on recognition and characterization of raw data received from the TLS instruments as well as processing phases, tools and techniques to do them. Main objective are definition and recognition of the problems related with usage of the TLS data, characterization of the quality single point generated by TLS, description and investigation of the TLS processing approach for landslides deformation measurements allowing to obtain 3D deformation characteristic and finally validation of the obtained results. The above objectives are based on the bibliography studies and research work followed by several experiments that will prove the conclusions.
Resumo:
The paper proposes a numerical solution method for general equilibrium models with a continuum of heterogeneous agents, which combines elements of projection and of perturbation methods. The basic idea is to solve first for the stationary solutionof the model, without aggregate shocks but with fully specified idiosyncratic shocks. Afterwards one computes a first-order perturbation of the solution in the aggregate shocks. This approach allows to include a high-dimensional representation of the cross-sectional distribution in the state vector. The method is applied to a model of household saving with uninsurable income risk and liquidity constraints. The model includes not only productivity shocks, but also shocks to redistributive taxation, which cause substantial short-run variation in the cross-sectional distribution of wealth. If those shocks are operative, it is shown that a solution method based on very few statistics of the distribution is not suitable, while the proposed method can solve the model with high accuracy, at least for the case of small aggregate shocks. Techniques are discussed to reduce the dimension of the state space such that higher order perturbations are feasible.Matlab programs to solve the model can be downloaded.
Resumo:
This work proposes novel network analysis techniques for multivariate time series.We define the network of a multivariate time series as a graph where verticesdenote the components of the process and edges denote non zero long run partialcorrelations. We then introduce a two step LASSO procedure, called NETS, toestimate high dimensional sparse Long Run Partial Correlation networks. This approachis based on a VAR approximation of the process and allows to decomposethe long run linkages into the contribution of the dynamic and contemporaneousdependence relations of the system. The large sample properties of the estimatorare analysed and we establish conditions for consistent selection and estimation ofthe non zero long run partial correlations. The methodology is illustrated with anapplication to a panel of U.S. bluechips.
Resumo:
We apply the theory of continuous time random walks (CTRWs) to study some aspects involving extreme events in financial time series. We focus our attention on the mean exit time (MET). We derive a general equation for this average and compare it with empirical results coming from high-frequency data of the U.S. dollar and Deutsche mark futures market. The empirical MET follows a quadratic law in the return length interval which is consistent with the CTRW formalism.
Resumo:
Lipoxygenases are non-heme iron enzymes essential in eukaryotes, where they catalyze the formation of the fatty acid hydroperoxides that are required by a large diversity of biological and pathological processes. In prokaryotes, most of them totally lacking in polyunsaturated fatty acids, the possible biological roles oflipoxygenases have remained obscure. In this study, it is reported the crystallization of a lipoxygenase of Pseudomonas aeruginosa (Pa_LOX), the first from a prokaryote. High resolution data has been acquired which is expected to yield structural clues to the questions adressed. Besides, a preliminar phylogenetic analysis using 14 sequences has confirmed the existence of this subfamily of bacterial lipoxygenases, on one side, and a greater diversity than in the corresponding eukaryotic ones, on the other. Finally, an evolutionary study of bacteriallipoxygenases on the same set of lipoxygenases, show a selection pressure of a basically purifying or neutral character except for a single aminoacid, which would have been selected after a positive selection event.
Resumo:
Lipoxygenases are non-heme iron enzymes essential in eukaryotes, where they catalyze the formation of the fatty acid hydroperoxides that are required by a large diversity of biological and pathological processes. In prokaryotes, most of them totally lacking in polyunsaturated fatty acids, the possible biological roles oflipoxygenases have remained obscure. In this study, it is reported the crystallization of a lipoxygenase of Pseudomonas aeruginosa (Pa_LOX), the first from a prokaryote. High resolution data has been acquired which is expected to yield structural clues to the questions adressed. Besides, a preliminar phylogenetic analysis using 14 sequences has confirmed the existence of this subfamily of bacterial lipoxygenases, on one side, and a greater diversity than in the corresponding eukaryotic ones, on the other. Finally, an evolutionary study of bacteriallipoxygenases on the same set of lipoxygenases, show a selection pressure of a basically purifying or neutral character except for a single aminoacid, which would have been selected after a positive selection event.
Resumo:
Information about the genomic coordinates and the sequence of experimentally identified transcription factor binding sites is found scattered under a variety of diverse formats. The availability of standard collections of such high-quality data is important to design, evaluate and improve novel computational approaches to identify binding motifs on promoter sequences from related genes. ABS (http://genome.imim.es/datasets/abs2005/index.html) is a public database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. A simple and easy-to-use web interface facilitates data retrieval allowing different views of the information. In addition, the release 1.0 of ABS includes a customizable generator of artificial datasets based on the known sites contained in the collection and an evaluation tool to aid during the training and the assessment of motif-finding programs.
Resumo:
In this work, a new one-class classification ensemble strategy called approximate polytope ensemble is presented. The main contribution of the paper is threefold. First, the geometrical concept of convex hull is used to define the boundary of the target class defining the problem. Expansions and contractions of this geometrical structure are introduced in order to avoid over-fitting. Second, the decision whether a point belongs to the convex hull model in high dimensional spaces is approximated by means of random projections and an ensemble decision process. Finally, a tiling strategy is proposed in order to model non-convex structures. Experimental results show that the proposed strategy is significantly better than state of the art one-class classification methods on over 200 datasets.
Resumo:
Contamination of weather radar echoes by anomalous propagation (anaprop) mechanisms remains a serious issue in quality control of radar precipitation estimates. Although significant progress has been made identifying clutter due to anaprop there is no unique method that solves the question of data reliability without removing genuine data. The work described here relates to the development of a software application that uses a numerical weather prediction (NWP) model to obtain the temperature, humidity and pressure fields to calculate the three dimensional structure of the atmospheric refractive index structure, from which a physically based prediction of the incidence of clutter can be made. This technique can be used in conjunction with existing methods for clutter removal by modifying parameters of detectors or filters according to the physical evidence for anomalous propagation conditions. The parabolic equation method (PEM) is a well established technique for solving the equations for beam propagation in a non-uniformly stratified atmosphere, but although intrinsically very efficient, is not sufficiently fast to be practicable for near real-time modelling of clutter over the entire area observed by a typical weather radar. We demonstrate a fast hybrid PEM technique that is capable of providing acceptable results in conjunction with a high-resolution terrain elevation model, using a standard desktop personal computer. We discuss the performance of the method and approaches for the improvement of the model profiles in the lowest levels of the troposphere.