947 resultados para High-dimensional models
Resumo:
Let P be a probability distribution on q -dimensional space. The so-called Diaconis-Freedman effect means that for a fixed dimension d<
Resumo:
We focus on mixtures of factor analyzers from the perspective of a method for model-based density estimation from high-dimensional data, and hence for the clustering of such data. This approach enables a normal mixture model to be fitted to a sample of n data points of dimension p, where p is large relative to n. The number of free parameters is controlled through the dimension of the latent factor space. By working in this reduced space, it allows a model for each component-covariance matrix with complexity lying between that of the isotropic and full covariance structure models. We shall illustrate the use of mixtures of factor analyzers in a practical example that considers the clustering of cell lines on the basis of gene expressions from microarray experiments. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
Aspergillus lentulus, an Aspergillus fumigatus sibling species, is increasingly reported in corticosteroid-treated patients. Its clinical significance is unknown, but the fact that A. lentulus shows reduced antifungal susceptibility, mainly to voriconazole, is of serious concern. Heterologous expression of cyp51A from A. fumigatus and A. lentulus was performed in Saccharomyces cerevisiae to assess differences in the interaction of Cyp51A with the azole drugs. The absence of endogenous ERG11 was efficiently complemented in S. cerevisiae by the expression of either Aspergillus cyp51A allele. There was a marked difference between azole minimum inhibitory concentration (MIC) values of the clones expressing each Aspergillus spp. cyp51A. Saccharomyces cerevisiae clones expressing A. lentulus alleles showed higher MICs to all of the azoles tested, supporting the hypothesis that the intrinsic azole resistance of A. lentulus could be associated with Cyp51A. Homology models of A. fumigatus and A. lentulus Cyp51A protein based on the crystal structure of Cyp51p from Mycobacterium tuberculosis in complex with fluconazole were almost identical owing to their mutual high sequence identity. Molecular dynamics (MD) was applied to both three-dimensional protein models to refine the homology modelling and to explore possible differences in the Cyp51A-voriconazole interaction. After 20ns of MD modelling, some critical differences were observed in the putative closed form adopted by the protein upon voriconazole binding. A closer study of the A. fumigatus and A. lentulus voriconazole putative binding site in Cyp51A suggested that some major differences in the protein's BC loop could differentially affect the lock-up of voriconazole, which in turn could correlate with their different azole susceptibility profiles.
Resumo:
Cette thèse étudie des modèles de séquences de haute dimension basés sur des réseaux de neurones récurrents (RNN) et leur application à la musique et à la parole. Bien qu'en principe les RNN puissent représenter les dépendances à long terme et la dynamique temporelle complexe propres aux séquences d'intérêt comme la vidéo, l'audio et la langue naturelle, ceux-ci n'ont pas été utilisés à leur plein potentiel depuis leur introduction par Rumelhart et al. (1986a) en raison de la difficulté de les entraîner efficacement par descente de gradient. Récemment, l'application fructueuse de l'optimisation Hessian-free et d'autres techniques d'entraînement avancées ont entraîné la recrudescence de leur utilisation dans plusieurs systèmes de l'état de l'art. Le travail de cette thèse prend part à ce développement. L'idée centrale consiste à exploiter la flexibilité des RNN pour apprendre une description probabiliste de séquences de symboles, c'est-à-dire une information de haut niveau associée aux signaux observés, qui en retour pourra servir d'à priori pour améliorer la précision de la recherche d'information. Par exemple, en modélisant l'évolution de groupes de notes dans la musique polyphonique, d'accords dans une progression harmonique, de phonèmes dans un énoncé oral ou encore de sources individuelles dans un mélange audio, nous pouvons améliorer significativement les méthodes de transcription polyphonique, de reconnaissance d'accords, de reconnaissance de la parole et de séparation de sources audio respectivement. L'application pratique de nos modèles à ces tâches est détaillée dans les quatre derniers articles présentés dans cette thèse. Dans le premier article, nous remplaçons la couche de sortie d'un RNN par des machines de Boltzmann restreintes conditionnelles pour décrire des distributions de sortie multimodales beaucoup plus riches. Dans le deuxième article, nous évaluons et proposons des méthodes avancées pour entraîner les RNN. Dans les quatre derniers articles, nous examinons différentes façons de combiner nos modèles symboliques à des réseaux profonds et à la factorisation matricielle non-négative, notamment par des produits d'experts, des architectures entrée/sortie et des cadres génératifs généralisant les modèles de Markov cachés. Nous proposons et analysons également des méthodes d'inférence efficaces pour ces modèles, telles la recherche vorace chronologique, la recherche en faisceau à haute dimension, la recherche en faisceau élagué et la descente de gradient. Finalement, nous abordons les questions de l'étiquette biaisée, du maître imposant, du lissage temporel, de la régularisation et du pré-entraînement.
Resumo:
Simulations of ozone loss rates using a three-dimensional chemical transport model and a box model during recent Antarctic and Arctic winters are compared with experimental loss rates. The study focused on the Antarctic winter 2003, during which the first Antarctic Match campaign was organized, and on Arctic winters 1999/2000, 2002/2003. The maximum ozone loss rates retrieved by the Match technique for the winters and levels studied reached 6 ppbv/sunlit hour and both types of simulations could generally reproduce the observations at 2-sigma error bar level. In some cases, for example, for the Arctic winter 2002/2003 at 475 K level, an excellent agreement within 1-sigma standard deviation level was obtained. An overestimation was also found with the box model simulation at some isentropic levels for the Antarctic winter and the Arctic winter 1999/2000, indicating an overestimation of chlorine activation in the model. Loss rates in the Antarctic show signs of saturation in September, which have to be considered in the comparison. Sensitivity tests were performed with the box model in order to assess the impact of kinetic parameters of the ClO-Cl2O2 catalytic cycle and total bromine content on the ozone loss rate. These tests resulted in a maximum change in ozone loss rates of 1.2 ppbv/sunlit hour, generally in high solar zenith angle conditions. In some cases, a better agreement was achieved with fastest photolysis of Cl2O2 and additional source of total inorganic bromine but at the expense of overestimation of smaller ozone loss rates derived later in the winter.
Resumo:
We establish a fundamental equivalence between singular value decomposition (SVD) and functional principal components analysis (FPCA) models. The constructive relationship allows to deploy the numerical efficiency of SVD to fully estimate the components of FPCA, even for extremely high-dimensional functional objects, such as brain images. As an example, a functional mixed effect model is fitted to high-resolution morphometric (RAVENS) images. The main directions of morphometric variation in brain volumes are identified and discussed.
Resumo:
Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.
Resumo:
Hierarchical visualization systems are desirable because a single two-dimensional visualization plot may not be sufficient to capture all of the interesting aspects of complex high-dimensional data sets. We extend an existing locally linear hierarchical visualization system PhiVis [1] in several directions: bf(1) we allow for em non-linear projection manifolds (the basic building block is the Generative Topographic Mapping -- GTM), bf(2) we introduce a general formulation of hierarchical probabilistic models consisting of local probabilistic models organized in a hierarchical tree, bf(3) we describe folding patterns of low-dimensional projection manifold in high-dimensional data space by computing and visualizing the manifold's local directional curvatures. Quantities such as magnification factors [3] and directional curvatures are helpful for understanding the layout of the nonlinear projection manifold in the data space and for further refinement of the hierarchical visualization plot. Like PhiVis, our system is statistically principled and is built interactively in a top-down fashion using the EM algorithm. We demonstrate the visualization system principle of the approach on a complex 12-dimensional data set and mention possible applications in the pharmaceutical industry.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
Persian Gulf region is globally of great importance due to its economical and political reasons. The importance lies in oil sources and sea exports. Geophysical phenomena dominated in the water circulation affected this region is called Monsoon it stretches from African coasts to the half way of Red Seal affected all coasts of Persian Gulf and goes toward east to the Indian ocean. Other essential factors in the water circulation in this region are net evaporation (several meters in per year), high density and high salinity. In this article the effects of wind stress and evaporation in the water circulation in the region will be considered and model equations for wind forces, density, pressure, gradient, and bottom friction for Persian Gulf will be discussed.
Resumo:
The aim of the present study was to propose and evaluate the use of factor analysis (FA) in obtaining latent variables (factors) that represent a set of pig traits simultaneously, for use in genome-wide selection (GWS) studies. We used crosses between outbred F2 populations of Brazilian Piau X commercial pigs. Data were obtained on 345 F2 pigs, genotyped for 237 SNPs, with 41 traits. FA allowed us to obtain four biologically interpretable factors: ?weight?, ?fat?, ?loin?, and ?performance?. These factors were used as dependent variables in multiple regression models of genomic selection (Bayes A, Bayes B, RR-BLUP, and Bayesian LASSO). The use of FA is presented as an interesting alternative to select individuals for multiple variables simultaneously in GWS studies; accuracy measurements of the factors were similar to those obtained when the original traits were considered individually. The similarities between the top 10% of individuals selected by the factor, and those selected by the individual traits, were also satisfactory. Moreover, the estimated markers effects for the traits were similar to those found for the relevant factor.
Resumo:
The main purpose of this thesis is to go beyond two usual assumptions that accompany theoretical analysis in spin-glasses and inference: the i.i.d. (independently and identically distributed) hypothesis on the noise elements and the finite rank regime. The first one appears since the early birth of spin-glasses. The second one instead concerns the inference viewpoint. Disordered systems and Bayesian inference have a well-established relation, evidenced by their continuous cross-fertilization. The thesis makes use of techniques coming both from the rigorous mathematical machinery of spin-glasses, such as the interpolation scheme, and from Statistical Physics, such as the replica method. The first chapter contains an introduction to the Sherrington-Kirkpatrick and spiked Wigner models. The first is a mean field spin-glass where the couplings are i.i.d. Gaussian random variables. The second instead amounts to establish the information theoretical limits in the reconstruction of a fixed low rank matrix, the “spike”, blurred by additive Gaussian noise. In chapters 2 and 3 the i.i.d. hypothesis on the noise is broken by assuming a noise with inhomogeneous variance profile. In spin-glasses this leads to multi-species models. The inferential counterpart is called spatial coupling. All the previous models are usually studied in the Bayes-optimal setting, where everything is known about the generating process of the data. In chapter 4 instead we study the spiked Wigner model where the prior on the signal to reconstruct is ignored. In chapter 5 we analyze the statistical limits of a spiked Wigner model where the noise is no longer Gaussian, but drawn from a random matrix ensemble, which makes its elements dependent. The thesis ends with chapter 6, where the challenging problem of high-rank probabilistic matrix factorization is tackled. Here we introduce a new procedure called "decimation" and we show that it is theoretically to perform matrix factorization through it.
Resumo:
Feature selection is a central problem in machine learning and pattern recognition. On large datasets (in terms of dimension and/or number of instances), using search-based or wrapper techniques can be cornputationally prohibitive. Moreover, many filter methods based on relevance/redundancy assessment also take a prohibitively long time on high-dimensional. datasets. In this paper, we propose efficient unsupervised and supervised feature selection/ranking filters for high-dimensional datasets. These methods use low-complexity relevance and redundancy criteria, applicable to supervised, semi-supervised, and unsupervised learning, being able to act as pre-processors for computationally intensive methods to focus their attention on smaller subsets of promising features. The experimental results, with up to 10(5) features, show the time efficiency of our methods, with lower generalization error than state-of-the-art techniques, while being dramatically simpler and faster.
Resumo:
...In dieser Arbeit untersuche ich den ”Fluch der Dimensionen” mittels dem Begriff der Distanzkonzentration. Ich zeige, dass dieser Effekt im Datenmodell mittels der paarweisen Kovarianzkoeffizienten der Randverteilungen beschrieben werden kann. Zusätzlich vergleiche ich 10 prototypbasierte Clusteralgorithmen mittels 800.000 Clusterergebnissen von künstlich erzeugten Datensätzen. Ich erforsche, wie und warum Clusteralgorithmen von der Anzahl der Merkmale beeinflusst werden. Mit den Clusterergebnissen untersuche ich außerdem, wie gut 5 der populärsten Clusterqualitätsmaße die tatsächliche Clusterqualität schätzen.