946 resultados para Bayesian Latent Class
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
Non-market effects of agriculture are often estimated using discrete choice models from stated preference surveys. In this context we propose two ways of modelling attribute non-attendance. The first involves constraining coefficients to zero in a latent class framework, whereas the second is based on stochastic attribute selection and grounded in Bayesian estimation. Their implications are explored in the context of a stated preference survey designed to value landscapes in Ireland. Taking account of attribute non-attendance with these data improves fit and tends to involve two attributes one of which is likely to be cost, thereby leading to substantive changes in derived welfare estimates.
Resumo:
Migraine is a common neurological disorder with a strong genetic basis. However, the complex nature of the disorder has meant that few genes or susceptibility loci have been identified and replicated consistently to confirm their involvement in migraine. Approaches to genetic studies of the disorder have included analysis of the rare migraine subtype, familial hemiplegic migraine with several causal genes identified for this severe subtype. However, the exact genetic contributors to the more common migraine subtypes are still to be deciphered. Genome-wide studies such as genome-wide association studies and linkage analysis as well as candidate genes studies have been employed to investigate genes involved in common migraine. Neurological, hormonal and vascular genes are all considered key factors in the pathophysiology of migraine and are a focus of many of these studies. It is clear that the influence of individual genes on the expression of this disorder will vary. Furthermore, the disorder may be dependent on gene–gene and gene–environment interactions that have not yet been considered. In addition, identifying susceptibility genes may require phenotyping methods outside of the International Classification of Headache Disorders II criteria, such as trait component analysis and latent class analysis to better define the ambit of migraine expression.
Resumo:
Background: Population-based surveys demonstrate cannabis users are more likely to use both illicit and licit substances, compared with non-cannabis users. Few studies have examined the substance use profiles of cannabis users referred for treatment. Co-existing mental health symptoms and underlying cannabis-related beliefs associated with these profiles remains unexplored. Methods: Comprehensive drug use and dependence severity (Severity of Dependence Scale-Cannabis) data were collected on a sample of 826 cannabis users referred for treatment. Patients completed the General Health Questionnaire, Cannabis Expectancy Questionnaire, Cannabis Refusal Self-Efficacy Questionnaire, and Positive Symptoms and Manic-Excitement subscales of the Brief Psychiatric Rating Scale. Latent class analysis was performed on last month use of drugs to identify patterns of multiple drug use. Mental health comorbidity and cannabis beliefs were examined by identified drug use pattern. Results: A three-class solution provided the best fit to the data: (1) cannabis and tobacco users (n = 176), (2) cannabis, tobacco, and alcohol users (n = 498), and (3) wide-ranging sub- stance users (n = 132). Wide-ranging substance users (3) reported higher levels of cannabis dependence severity, negative cannabis expectancies, lower opportunistic, and emotional relief self-efficacy, higher levels of depression and anxiety and higher manic-excitement and positive psychotic symptoms. Conclusion: In a sample of cannabis users referred for treatment, wide-ranging substance use was associated with elevated risk on measures of cannabis dependence, co-morbid psychopathology, and dysfunctional cannabis cognitions. These findings have implications for cognitive-behavioral assessment and treatment.
Resumo:
As a sequel to a paper that dealt with the analysis of two-way quantitative data in large germplasm collections, this paper presents analytical methods appropriate for two-way data matrices consisting of mixed data types, namely, ordered multicategory and quantitative data types. While various pattern analysis techniques have been identified as suitable for analysis of the mixed data types which occur in germplasm collections, the clustering and ordination methods used often can not deal explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions) with incomplete information. However, it is shown that the ordination technique of principal component analysis and the mixture maximum likelihood method of clustering can be employed to achieve such analyses. Germplasm evaluation data for 11436 accessions of groundnut (Arachis hypogaea L.) from the International Research Institute of the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the post-rainy season and five ordered multicategory descriptors were used. Pattern analysis results generally indicated that the accessions could be distinguished into four regions along the continuum of growth habit (or plant erectness). Interpretation of accession membership in these regions was found to be consistent with taxonomic information, such as subspecies. Each growth habit region contained accessions from three of the most common groundnut botanical varieties. This implies that within each of the habit types there is the full range of expression for the other descriptors used in the analysis. Using these types of insights, the patterns of variability in germplasm collections can provide scientists with valuable information for their plant improvement programs.
Resumo:
This paper evaluates the operational activities of Chinese hydroelectric power companies over the period 2000-2010 using a finite mixture model that controls for unobserved heterogeneity. In so doing, a stochastic frontier latent class model, which allows for the existence of different technologies, is adopted to estimate cost frontiers. This procedure not only enables us to identify different groups among the hydro-power companies analysed, but also permits the analysis of their cost efficiency. The main result is that three groups are identified in the sample, each equipped with different technologies, suggesting that distinct business strategies need to be adapted to the characteristics of China's hydro-power companies. Some managerial implications are developed. © 2012 Elsevier B.V.
Resumo:
This paper evaluates the production activities of Japanese airports by using a finite mixture model that allows controlling for unobserved heterogeneity. In doing so, a stochastic frontier latent class model, which allows the existence of different technologies, is adopted to estimate production frontiers. This procedure not only enables the identification of different groups of Japanese airports but also permits the analysis of their production efficiency. The main result is that there are two groups of Japanese airports, both following completely different "technologies" to obtain passengers and cargo, suggesting that business strategies need to be adapted to the characteristics of the airports. Some managerial implications are developed.
Resumo:
Introduction and aims: Despite evidence that many Australian adolescents have considerable experience with various drug types, little is known about the extent to which adolescents use multiple substances. The aim of this study was to examine the degree of clustering of drug types within individuals, and the extent to which demographic and psychosocial predictors are related to cluster membership. Design and method: A sample of 1402 adolescents aged 12-17. years were extracted from the Australian 2007 National Drug Strategy Household Survey. Extracted data included lifetime use of 10 substances, gender, psychological distress, physical health, perceived peer substance use, socioeconomic disadvantage, and regionality. Latent class analysis was used to determine clusters, and multinomial logistic regression employed to examine predictors of cluster membership. Result: There were 3 latent classes. The great majority (79.6%) of adolescents used alcohol only, 18.3% were limited range multidrug users (encompassing alcohol, tobacco, and marijuana), and 2% were extended range multidrug users. Perceived peer drug use and psychological distress predicted limited and extended multiple drug use. Psychological distress was a more significant predictor of extended multidrug use compared to limited multidrug use. Discussion and conclusion: In the Australian school-based prevention setting, a very strong focus on alcohol use and the linkages between alcohol, tobacco and marijuana are warranted. Psychological distress may be an important target for screening and early intervention for adolescents who use multiple drugs.
Resumo:
PURPOSE/OBJECTIVES: To identify latent classes of individuals with distinct quality-of-life (QOL) trajectories, to evaluate for differences in demographic characteristics between the latent classes, and to evaluate for variations in pro- and anti-inflammatory cytokine genes between the latent classes. DESIGN: Descriptive, longitudinal study. SETTING: Two radiation therapy departments located in a comprehensive cancer center and a community-based oncology program in northern California. SAMPLE: 168 outpatients with prostate, breast, brain, or lung cancer and 85 of their family caregivers (FCs). METHODS: Growth mixture modeling (GMM) was employed to identify latent classes of individuals based on QOL scores measured prior to, during, and for four months following completion of radiation therapy. Single nucleotide polymorphisms (SNPs) and haplotypes in 16 candidate cytokine genes were tested between the latent classes. Logistic regression was used to evaluate the relationships among genotypic and phenotypic characteristics and QOL GMM group membership. MAIN RESEARCH VARIABLES: QOL latent class membership and variations in cytokine genes. FINDINGS: Two latent QOL classes were found: higher and lower. Patients and FCs who were younger, identified with an ethnic minority group, had poorer functional status, or had children living at home were more likely to belong to the lower QOL class. After controlling for significant covariates, between-group differences were found in SNPs in interleukin 1 receptor 2 (IL1R2) and nuclear factor kappa beta 2 (NFKB2). For IL1R2, carrying one or two doses of the rare C allele was associated with decreased odds of belonging to the lower QOL class. For NFKB2, carriers with two doses of the rare G allele were more likely to belong to the lower QOL class. CONCLUSIONS: Unique genetic markers in cytokine genes may partially explain interindividual variability in QOL. IMPLICATIONS FOR NURSING: Determination of high-risk characteristics and unique genetic markers would allow for earlier identification of patients with cancer and FCs at higher risk for poorer QOL. Knowledge of these risk factors could assist in the development of more targeted clinical or supportive care interventions for those identified.
Resumo:
The scale of environmental problems in China is clearly evident. This paper analyses foreign direct investment (FDI) in China with a finite mixture model, also known as latent class model to understand the relationship between FDI and several pollutions. This is used to regresses FDI as function covariates including pollutants. The results reveal that FDI is affected by pollutants. There are cases reducing pollution deters foreign investment in China.
Resumo:
This is a methodological paper describing when and how manifest items dropped from a latent construct measurement model (e.g., factor analysis) can be retained for additional analysis. Presented are protocols for assessment for retention in the measurement model, evaluation of dropped items as potential items separate from the latent construct, and post hoc analyses that can be conducted using all retained (manifest or latent) variables. The protocols are then applied to data relating to the impact of the NAPLAN test. The variables examined are teachers’ achievement goal orientations and teachers’ perceptions of the impact of the test on curriculum and pedagogy. It is suggested that five attributes be considered before retaining dropped manifest items for additional analyses. (1) Items can be retained when employed in service of an established or hypothesized theoretical model. (2) Items should only be retained if sufficient variance is present in the data set. (3) Items can be retained when they provide a rational segregation of the data set into subsamples (e.g., a consensus measure). (4) The value of retaining items can be assessed using latent class analysis or latent mean analysis. (5) Items should be retained only when post hoc analyses with these items produced significant and substantive results. These suggested exploratory strategies are presented so that other researchers using survey instruments might explore their data in similar and more innovative ways. Finally, suggestions for future use are provided.
Resumo:
This paper addresses the gap in economic theory underlying the multidimensional concept of food security and observed data by deriving a composite food security index using the latent class model. The link between poverty and food security is then examined using the new food security index and the robustness of the link is compared with two unidimensional measures often used in the literature. Using Vietnam as a case study, it was found that a weak link exists for the rural but not for the urban composite food security index. The unidimensional measures on the other hand show a strong link in both the rural and urban regions. The results on the link are also different and mixed when two poverty types given by persistent and transient poverty are considered. These findings have important policy implications for a targeted approach to addressing food security.
Resumo:
Prior genome-wide association studies (GWAS) of major depressive disorder (MDD) have met with limited success. We sought to increase statistical power to detect disease loci by conducting a GWAS mega-analysis for MDD. In the MDD discovery phase, we analyzed more than 1.2 million autosomal and X chromosome single-nucleotide polymorphisms (SNPs) in 18 759 independent and unrelated subjects of recent European ancestry (9240 MDD cases and 9519 controls). In the MDD replication phase, we evaluated 554 SNPs in independent samples (6783 MDD cases and 50 695 controls). We also conducted a cross-disorder meta-analysis using 819 autosomal SNPs with P<0.0001 for either MDD or the Psychiatric GWAS Consortium bipolar disorder (BIP) mega-analysis (9238 MDD cases/8039 controls and 6998 BIP cases/7775 controls). No SNPs achieved genome-wide significance in the MDD discovery phase, the MDD replication phase or in pre-planned secondary analyses (by sex, recurrent MDD, recurrent early-onset MDD, age of onset, pre-pubertal onset MDD or typical-like MDD from a latent class analyses of the MDD criteria). In the MDD-bipolar cross-disorder analysis, 15 SNPs exceeded genome-wide significance (P<5 x 10(-8)), and all were in a 248 kb interval of high LD on 3p21.1 (chr3:52 425 083-53 822 102, minimum P=5.9 x 10(-9) at rs2535629). Although this is the largest genome-wide analysis of MDD yet conducted, its high prevalence means that the sample is still underpowered to detect genetic effects typical for complex traits. Therefore, we were unable to identify robust and replicable findings. We discuss what this means for genetic research for MDD. The 3p21.1 MDD-BIP finding should be interpreted with caution as the most significant SNP did not replicate in MDD samples, and genotyping in independent samples will be needed to resolve its status.
Resumo:
The aims of this dissertation were 1) to investigate associations of weight status of adolescents with leisure activities, and computer and cell phone use, and 2) to investigate environmental and genetic influences on body mass index (BMI) during adolescence. Finnish twins born in 1983–1987 responded to postal questionnaires at the ages of 11-12 (5184 participants), 14 (4643 participants), and 17 years (4168 participants). Information was obtained on weight and height, leisure activities including television viewing, video viewing, computer games, listening to music, board games, musical instrument playing, reading, arts, crafts, socializing, clubs, sports, and outdoor activities, as well as computer and cell phone use. Activity patterns were studied using latent class analysis. The relationship between leisure activities and weight status was investigated using logistic and linear regression. Genetic and environmental effects on BMI were studied using twin modeling. Of individual leisure activities, sports were associated with decreased overweight risk among boys in both cross-sectional and longitudinal analyses, but among girls only cross-sectionally. Many sedentary leisure activities, such as video viewing (boys/girls), arts (boys), listening to music (boys), crafts (girls), and board games (girls), had positive associations with being overweight. Computer use was associated with a higher prevalence of overweight in cross-sectional analyses. However, musical instrument playing, commonly considered as a sedentary activity, was associated with a decreased overweight risk among boys. Four patterns of leisure activities were found: ‘Active and sociable’, ‘Active but less sociable’, ‘Passive but sociable’, and ‘Passive and solitary’. The prevalence of overweight was generally highest among the ‘Passive and solitary’ adolescents. Overall, leisure activity patterns did not predict overweight risk later in adolescence. An exception were 14-year-old ‘Passive and solitary’ girls who had the greatest risk of becoming overweight by 17 years of age. Heritability of BMI was high (0.58-0.83). Common environmental factors shared by family-members affected the BMI at 11-12 and 14 years but their effect had disappeared by 17 years of age. Additive genetic factors explained 90-96% of the BMI stability across adolescence. Genetic correlations across adolescence were high, which suggests similar genetic effects on BMI throughout adolescence, while unique environmental effects on BMI appeared to vary. These findings suggest that family-based interventions hold promise for obesity prevention into early and middle adolescence, but that later in adolescence obesity prevention should focus on individuals. A useful target could be adolescents' leisure time, and our findings highlight the importance of versatility in leisure activities.