3 resultados para Semi-supervised learning methods

em Duke University


Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVE: To demonstrate the application of causal inference methods to observational data in the obstetrics and gynecology field, particularly causal modeling and semi-parametric estimation. BACKGROUND: Human immunodeficiency virus (HIV)-positive women are at increased risk for cervical cancer and its treatable precursors. Determining whether potential risk factors such as hormonal contraception are true causes is critical for informing public health strategies as longevity increases among HIV-positive women in developing countries. METHODS: We developed a causal model of the factors related to combined oral contraceptive (COC) use and cervical intraepithelial neoplasia 2 or greater (CIN2+) and modified the model to fit the observed data, drawn from women in a cervical cancer screening program at HIV clinics in Kenya. Assumptions required for substantiation of a causal relationship were assessed. We estimated the population-level association using semi-parametric methods: g-computation, inverse probability of treatment weighting, and targeted maximum likelihood estimation. RESULTS: We identified 2 plausible causal paths from COC use to CIN2+: via HPV infection and via increased disease progression. Study data enabled estimation of the latter only with strong assumptions of no unmeasured confounding. Of 2,519 women under 50 screened per protocol, 219 (8.7%) were diagnosed with CIN2+. Marginal modeling suggested a 2.9% (95% confidence interval 0.1%, 6.9%) increase in prevalence of CIN2+ if all women under 50 were exposed to COC; the significance of this association was sensitive to method of estimation and exposure misclassification. CONCLUSION: Use of causal modeling enabled clear representation of the causal relationship of interest and the assumptions required to estimate that relationship from the observed data. Semi-parametric estimation methods provided flexibility and reduced reliance on correct model form. Although selected results suggest an increased prevalence of CIN2+ associated with COC, evidence is insufficient to conclude causality. Priority areas for future studies to better satisfy causal criteria are identified.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Bayesian methods offer a flexible and convenient probabilistic learning framework to extract interpretable knowledge from complex and structured data. Such methods can characterize dependencies among multiple levels of hidden variables and share statistical strength across heterogeneous sources. In the first part of this dissertation, we develop two dependent variational inference methods for full posterior approximation in non-conjugate Bayesian models through hierarchical mixture- and copula-based variational proposals, respectively. The proposed methods move beyond the widely used factorized approximation to the posterior and provide generic applicability to a broad class of probabilistic models with minimal model-specific derivations. In the second part of this dissertation, we design probabilistic graphical models to accommodate multimodal data, describe dynamical behaviors and account for task heterogeneity. In particular, the sparse latent factor model is able to reveal common low-dimensional structures from high-dimensional data. We demonstrate the effectiveness of the proposed statistical learning methods on both synthetic and real-world data.