4 resultados para size-selection
em DigitalCommons@The Texas Medical Center
Resumo:
With hundreds of single nucleotide polymorphisms (SNPs) in a candidate gene and millions of SNPs across the genome, selecting an informative subset of SNPs to maximize the ability to detect genotype-phenotype association is of great interest and importance. In addition, with a large number of SNPs, analytic methods are needed that allow investigators to control the false positive rate resulting from large numbers of SNP genotype-phenotype analyses. This dissertation uses simulated data to explore methods for selecting SNPs for genotype-phenotype association studies. I examined the pattern of linkage disequilibrium (LD) across a candidate gene region and used this pattern to aid in localizing a disease-influencing mutation. The results indicate that the r2 measure of linkage disequilibrium is preferred over the common D′ measure for use in genotype-phenotype association studies. Using step-wise linear regression, the best predictor of the quantitative trait was not usually the single functional mutation. Rather it was a SNP that was in high linkage disequilibrium with the functional mutation. Next, I compared three strategies for selecting SNPs for application to phenotype association studies: based on measures of linkage disequilibrium, based on a measure of haplotype diversity, and random selection. The results demonstrate that SNPs selected based on maximum haplotype diversity are more informative and yield higher power than randomly selected SNPs or SNPs selected based on low pair-wise LD. The data also indicate that for genes with small contribution to the phenotype, it is more prudent for investigators to increase their sample size than to continuously increase the number of SNPs in order to improve statistical power. When typing large numbers of SNPs, researchers are faced with the challenge of utilizing an appropriate statistical method that controls the type I error rate while maintaining adequate power. We show that an empirical genotype based multi-locus global test that uses permutation testing to investigate the null distribution of the maximum test statistic maintains a desired overall type I error rate while not overly sacrificing statistical power. The results also show that when the penetrance model is simple the multi-locus global test does as well or better than the haplotype analysis. However, for more complex models, haplotype analyses offer advantages. The results of this dissertation will be of utility to human geneticists designing large-scale multi-locus genotype-phenotype association studies. ^
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
Sizes and power of selected two-sample tests of the equality of survival distributions are compared by simulation for small samples from unequally, randomly-censored exponential distributions. The tests investigated include parametric tests (F, Score, Likelihood, Asymptotic), logrank tests (Mantel, Peto-Peto), and Wilcoxon-Type tests (Gehan, Prentice). Equal sized samples, n = 18, 16, 32 with 1000 (size) and 500 (power) simulation trials, are compared for 16 combinations of the censoring proportions 0%, 20%, 40%, and 60%. For n = 8 and 16, the Asymptotic, Peto-Peto, and Wilcoxon tests perform at nominal 5% size expectations, but the F, Score and Mantel tests exceeded 5% size confidence limits for 1/3 of the censoring combinations. For n = 32, all tests showed proper size, with the Peto-Peto test most conservative in the presence of unequal censoring. Powers of all tests are compared for exponential hazard ratios of 1.4 and 2.0. There is little difference in power characteristics of the tests within the classes of tests considered. The Mantel test showed 90% to 95% power efficiency relative to parametric tests. Wilcoxon-type tests have the lowest relative power but are robust to differential censoring patterns. A modified Peto-Peto test shows power comparable to the Mantel test. For n = 32, a specific Weibull-exponential comparison of crossing survival curves suggests that the relative powers of logrank and Wilcoxon-type tests are dependent on the scale parameter of the Weibull distribution. Wilcoxon-type tests appear more powerful than logrank tests in the case of late-crossing and less powerful for early-crossing survival curves. Guidelines for the appropriate selection of two-sample tests are given. ^
Resumo:
When choosing among models to describe categorical data, the necessity to consider interactions makes selection more difficult. With just four variables, considering all interactions, there are 166 different hierarchical models and many more non-hierarchical models. Two procedures have been developed for categorical data which will produce the "best" subset or subsets of each model size where size refers to the number of effects in the model. Both procedures are patterned after the Leaps and Bounds approach used by Furnival and Wilson for continuous data and do not generally require fitting all models. For hierarchical models, likelihood ratio statistics (G('2)) are computed using iterative proportional fitting and "best" is determined by comparing, among models with the same number of effects, the Pr((chi)(,k)('2) (GREATERTHEQ) G(,ij)('2)) where k is the degrees of freedom for ith model of size j. To fit non-hierarchical as well as hierarchical models, a weighted least squares procedure has been developed.^ The procedures are applied to published occupational data relating to the occurrence of byssinosis. These results are compared to previously published analyses of the same data. Also, the procedures are applied to published data on symptoms in psychiatric patients and again compared to previously published analyses.^ These procedures will make categorical data analysis more accessible to researchers who are not statisticians. The procedures should also encourage more complex exploratory analyses of epidemiologic data and contribute to the development of new hypotheses for study. ^