4 resultados para model selection in binary regression
em Duke University
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
Resumo:
Background: Although many studies have investigated sexual communication between parents and children in Kenya, none have focused singularly on grandparent and grandchild communication when grandparents are primary caregivers. Further, few studies have asked about specific topics related to sex, instead asking generally about “sex related topics” or focusing on HIV/AIDS. This research aims to investigate communication on ten specific sex- related topics between grandparents who are primary caregivers and their grandchildren. The primary research aim was to identify facilitators and barriers to grandparent-grandchild communication associated with frequency of communication. A secondary exploratory question was whether frequency of communication and youth satisfaction with communication were associated with youth’s desire for more communication in the future. Methods: The study was conducted in urban and peri-urban central Kenya. A convenience sample of 193 grandparents and 166 twelve to fifteen year old grandchildren were identified by community health workers. A cross sectional survey assessed nine potential barriers or facilitators to communication (e.g., frequency of communication, perceived grandparent knowledge, grandparent sense of responsibility to communication on a given topic) on ten specified sex- related topics (e.g., peer pressure on sex topics, romantic relationships, condoms). Bivariate and multivariable analyses identified significant associations between communication variables and the outcomes of interest. Results: Bivariate regression showed that higher grandchild age, grandchild gender, higher perceived grandparent knowledge, higher perceived grandparent comfort, higher grandparent-reported sense of responsibility, higher grandparent-reported belief that child should be aware of a given topic before initiating in sex, and higher youth’s own comfort during communication, were significantly associated with higher levels of communication frequency. In the multivariable model, higher grandchild age, gender, higher comfort during communication, and higher perceived grandparent knowledge remained significantly associated with higher levels communication frequency. For the secondary research question, higher communication frequency and higher levels of youth satisfaction were both significantly associated with higher levels of youth desire for more communication in bivariate regression, and higher levels of youth’s satisfaction with communication remained significantly associated with higher levels of youth’s desire for more in the adjusted analysis. Conclusions: This study found that several potential barriers and facilitators of communication are associated with both frequency of and youth’s desire for more communication. The association between grandchild age, gender and perceived grandparent knowledge and frequency of communication is similar to findings from other studies that have examined sex-related communication between parent primary caregivers and children. This finding has important implications for understanding grandparent and grandchild communication, and communication on specific topics in a population from Kenya. The positive association between youth satisfaction of and desire for more communication has important education policy and intervention implications, suggesting that if youth are satisfied with the communication with their caregivers, they may want to learn more.
Resumo:
Mixtures of Zellner's g-priors have been studied extensively in linear models and have been shown to have numerous desirable properties for Bayesian variable selection and model averaging. Several extensions of g-priors to Generalized Linear Models (GLMs) have been proposed in the literature; however, the choice of prior distribution of g and resulting properties for inference have received considerably less attention. In this paper, we extend mixtures of g-priors to GLMs by assigning the truncated Compound Confluent Hypergeometric (tCCH) distribution to 1/(1+g) and illustrate how this prior distribution encompasses several special cases of mixtures of g-priors in the literature, such as the Hyper-g, truncated Gamma, Beta-prime, and the Robust prior. Under an integrated Laplace approximation to the likelihood, the posterior distribution of 1/(1+g) is in turn a tCCH distribution, and approximate marginal likelihoods are thus available analytically. We discuss the local geometric properties of the g-prior in GLMs and show that specific choices of the hyper-parameters satisfy the various desiderata for model selection proposed by Bayarri et al, such as asymptotic model selection consistency, information consistency, intrinsic consistency, and measurement invariance. We also illustrate inference using these priors and contrast them to others in the literature via simulation and real examples.