13 resultados para Latent variables
em Duke University
Resumo:
Gaussian factor models have proven widely useful for parsimoniously characterizing dependence in multivariate data. There is a rich literature on their extension to mixed categorical and continuous variables, using latent Gaussian variables or through generalized latent trait models acommodating measurements in the exponential family. However, when generalizing to non-Gaussian measured variables the latent variables typically influence both the dependence structure and the form of the marginal distributions, complicating interpretation and introducing artifacts. To address this problem we propose a novel class of Bayesian Gaussian copula factor models which decouple the latent factors from the marginal distributions. A semiparametric specification for the marginals based on the extended rank likelihood yields straightforward implementation and substantial computational gains. We provide new theoretical and empirical justifications for using this likelihood in Bayesian inference. We propose new default priors for the factor loadings and develop efficient parameter-expanded Gibbs sampling for posterior computation. The methods are evaluated through simulations and applied to a dataset in political science. The models in this paper are implemented in the R package bfa.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
Bayesian methods offer a flexible and convenient probabilistic learning framework to extract interpretable knowledge from complex and structured data. Such methods can characterize dependencies among multiple levels of hidden variables and share statistical strength across heterogeneous sources. In the first part of this dissertation, we develop two dependent variational inference methods for full posterior approximation in non-conjugate Bayesian models through hierarchical mixture- and copula-based variational proposals, respectively. The proposed methods move beyond the widely used factorized approximation to the posterior and provide generic applicability to a broad class of probabilistic models with minimal model-specific derivations. In the second part of this dissertation, we design probabilistic graphical models to accommodate multimodal data, describe dynamical behaviors and account for task heterogeneity. In particular, the sparse latent factor model is able to reveal common low-dimensional structures from high-dimensional data. We demonstrate the effectiveness of the proposed statistical learning methods on both synthetic and real-world data.
Resumo:
This article examines the behavior of equity trading volume and volatility for the individual firms composing the Standard & Poor's 100 composite index. Using multivariate spectral methods, we find that fractionally integrated processes best describe the long-run temporal dependencies in both series. Consistent with a stylized mixture-of-distributions hypothesis model in which the aggregate "news"-arrival process possesses long-memory characteristics, the long-run hyperbolic decay rates appear to be common across each volume-volatility pair.
Resumo:
We develop a model for stochastic processes with random marginal distributions. Our model relies on a stick-breaking construction for the marginal distribution of the process, and introduces dependence across locations by using a latent Gaussian copula model as the mechanism for selecting the atoms. The resulting latent stick-breaking process (LaSBP) induces a random partition of the index space, with points closer in space having a higher probability of being in the same cluster. We develop an efficient and straightforward Markov chain Monte Carlo (MCMC) algorithm for computation and discuss applications in financial econometrics and ecology. This article has supplementary material online.
Resumo:
We describe a strategy for Markov chain Monte Carlo analysis of non-linear, non-Gaussian state-space models involving batch analysis for inference on dynamic, latent state variables and fixed model parameters. The key innovation is a Metropolis-Hastings method for the time series of state variables based on sequential approximation of filtering and smoothing densities using normal mixtures. These mixtures are propagated through the non-linearities using an accurate, local mixture approximation method, and we use a regenerating procedure to deal with potential degeneracy of mixture components. This provides accurate, direct approximations to sequential filtering and retrospective smoothing distributions, and hence a useful construction of global Metropolis proposal distributions for simulation of posteriors for the set of states. This analysis is embedded within a Gibbs sampler to include uncertain fixed parameters. We give an example motivated by an application in systems biology. Supplemental materials provide an example based on a stochastic volatility model as well as MATLAB code.
Resumo:
Tumor microenvironmental stresses, such as hypoxia and lactic acidosis, play important roles in tumor progression. Although gene signatures reflecting the influence of these stresses are powerful approaches to link expression with phenotypes, they do not fully reflect the complexity of human cancers. Here, we describe the use of latent factor models to further dissect the stress gene signatures in a breast cancer expression dataset. The genes in these latent factors are coordinately expressed in tumors and depict distinct, interacting components of the biological processes. The genes in several latent factors are highly enriched in chromosomal locations. When these factors are analyzed in independent datasets with gene expression and array CGH data, the expression values of these factors are highly correlated with copy number alterations (CNAs) of the corresponding BAC clones in both the cell lines and tumors. Therefore, variation in the expression of these pathway-associated factors is at least partially caused by variation in gene dosage and CNAs among breast cancers. We have also found the expression of two latent factors without any chromosomal enrichment is highly associated with 12q CNA, likely an instance of "trans"-variations in which CNA leads to the variations in gene expression outside of the CNA region. In addition, we have found that factor 26 (1q CNA) is negatively correlated with HIF-1alpha protein and hypoxia pathways in breast tumors and cell lines. This agrees with, and for the first time links, known good prognosis associated with both a low hypoxia signature and the presence of CNA in this region. Taken together, these results suggest the possibility that tumor segmental aneuploidy makes significant contributions to variation in the lactic acidosis/hypoxia gene signatures in human cancers and demonstrate that latent factor analysis is a powerful means to uncover such a linkage.
Resumo:
OBJECTIVE: To examine the associations between attention-deficit/hyperactivity disorder (ADHD) symptoms, obesity and hypertension in young adults in a large population-based cohort. DESIGN, SETTING AND PARTICIPANTS: The study population consisted of 15,197 respondents from the National Longitudinal Study of Adolescent Health, a nationally representative sample of adolescents followed from 1995 to 2009 in the United States. Multinomial logistic and logistic models examined the odds of overweight, obesity and hypertension in adulthood in relation to retrospectively reported ADHD symptoms. Latent curve modeling was used to assess the association between symptoms and naturally occurring changes in body mass index (BMI) from adolescence to adulthood. RESULTS: Linear association was identified between the number of inattentive (IN) and hyperactive/impulsive (HI) symptoms and waist circumference, BMI, diastolic blood pressure and systolic blood pressure (all P-values for trend <0.05). Controlling for demographic variables, physical activity, alcohol use, smoking and depressive symptoms, those with three or more HI or IN symptoms had the highest odds of obesity (HI 3+, odds ratio (OR)=1.50, 95% confidence interval (CI) = 1.22-2.83; IN 3+, OR = 1.21, 95% CI = 1.02-1.44) compared with those with no HI or IN symptoms. HI symptoms at the 3+ level were significantly associated with a higher OR of hypertension (HI 3+, OR = 1.24, 95% CI = 1.01-1.51; HI continuous, OR = 1.04, 95% CI = 1.00-1.09), but associations were nonsignificant when models were adjusted for BMI. Latent growth modeling results indicated that compared with those reporting no HI or IN symptoms, those reporting 3 or more symptoms had higher initial levels of BMI during adolescence. Only HI symptoms were associated with change in BMI. CONCLUSION: Self-reported ADHD symptoms were associated with adult BMI and change in BMI from adolescence to adulthood, providing further evidence of a link between ADHD symptoms and obesity.
Resumo:
We discuss a general approach to dynamic sparsity modeling in multivariate time series analysis. Time-varying parameters are linked to latent processes that are thresholded to induce zero values adaptively, providing natural mechanisms for dynamic variable inclusion/selection. We discuss Bayesian model specification, analysis and prediction in dynamic regressions, time-varying vector autoregressions, and multivariate volatility models using latent thresholding. Application to a topical macroeconomic time series problem illustrates some of the benefits of the approach in terms of statistical and economic interpretations as well as improved predictions. Supplementary materials for this article are available online. © 2013 Copyright Taylor and Francis Group, LLC.
Resumo:
Learning multiple tasks across heterogeneous domains is a challenging problem since the feature space may not be the same for different tasks. We assume the data in multiple tasks are generated from a latent common domain via sparse domain transforms and propose a latent probit model (LPM) to jointly learn the domain transforms, and the shared probit classifier in the common domain. To learn meaningful task relatedness and avoid over-fitting in classification, we introduce sparsity in the domain transforms matrices, as well as in the common classifier. We derive theoretical bounds for the estimation error of the classifier in terms of the sparsity of domain transforms. An expectation-maximization algorithm is derived for learning the LPM. The effectiveness of the approach is demonstrated on several real datasets.
Resumo:
Imagery and concreteness norms and percentage noun usage were obtained on the 1,080 verbal items from the Toronto Word Pool. Imagery was defined as the rated ease with which a word aroused a mental image, and concreteness was defined in relation to level of abstraction. The degree to which a word was functionally a noun was estimated in a sentence generation task. The mean and standard deviation of the imagery and concreteness ratings for each item are reported together with letter and printed frequency counts for the words and indications of sex differences in the ratings. Additional data in the norms include a grammatical function code derived from dictionary definitions, a percent noun judgment, indexes of statistical approximation to English, and an orthographic neighbor ratio. Validity estimates for the imagery and concreteness ratings are derived from comparisons with scale values drawn from the Paivio, Yuille, and Madigan (1968) noun pool and the Toglia and Battig (1978) norms. © 1982 Psychonomic Society, Inc.
Resumo:
OBJECTIVES: Two factors have been considered important contributors to tooth wear: dietary abrasives in plant foods themselves and mineral particles adhering to ingested food. Each factor limits the functional life of teeth. Cross-population studies of wear rates in a single species living in different habitats may point to the relative contributions of each factor. MATERIALS AND METHODS: We examine macroscopic dental wear in populations of Alouatta palliata (Gray, 1849) from Costa Rica (115 specimens), Panama (19), and Nicaragua (56). The sites differ in mean annual precipitation, with the Panamanian sites receiving more than twice the precipitation of those in Costa Rica or Nicaragua (∼3,500 mm vs. ∼1,500 mm). Additionally, many of the Nicaraguan specimens were collected downwind of active plinian volcanoes. Molar wear is expressed as the ratio of exposed dentin area to tooth area; premolar wear was scored using a ranking system. RESULTS: Despite substantial variation in environmental variables and the added presence of ash in some environments, molar wear rates do not differ significantly among the populations. Premolar wear, however, is greater in individuals collected downwind from active volcanoes compared with those living in environments that did not experience ash-fall. DISCUSSION: Volcanic ash seems to be an important contributor to anterior tooth wear but less so in molar wear. That wear is not found uniformly across the tooth row may be related to malformation in the premolars due to fluorosis. A surge of fluoride accompanying the volcanic ash may differentially affect the premolars as the molars fully mineralize early in the life of Alouatta.