3 resultados para Classifier Generalization Ability
em DigitalCommons@The Texas Medical Center
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^
Resumo:
We designed and synthesized a novel daunorubicin (DNR) analogue that effectively circumvents P-glycoprotein (P-gp)-mediated drug resistance. The fully protected carbohydrate intermediate 1,2-dibromoacosamine was prepared from acosamine and effectively coupled to daunomycinone in high yield. Deprotection under alkaline conditions yielded 2$\sp\prime$-bromo-4$\sp\prime$-epidaunorubicin (WP401). The in vitro cytotoxicity and cellular and molecular pharmacology of WP401 were compared with those of DNR in a panel of wild-type cell lines (KB-3-1, P388S, and HL60S) and their multidrug-resistant (MDR) counterparts (KB-V1, P388/DOX, and HL60/DOX). Fluorescent spectrophotometry, flow cytometry, and confocal laser scanning microscopy were used to measure intracellular accumulation, retention, and subcellular distribution of these agents. All MDR cell lines exhibited reduced DNR uptake that was restored, upon incubation with either verapamil (VER) or cyclosporin A (CSA), to the level found in sensitive cell lines. In contrast, the uptake of WP401 was essentially the same in the absence or presence of VER or CSA in all tested cell lines. The in vitro cytotoxicity of WP401 was similar to that of DNR in the sensitive cell lines but significantly higher in resistant cell lines (resistance index (RI) of 2-6 for WP401 vs 75-85 for DNR). To ascertain whether drug-mediated cytotoxicity and retention were accompanied by DNA strand breaks, DNA single- and double-strand breaks were assessed by alkaline elution. High levels of such breaks were obtained using 0.1-2 $\mu$g/mL of WP401 in both sensitive and resistant cells. In contrast, DNR caused strand breaks only in sensitive cells and not much in resistant cells. We also compared drug-induced DNA fragmentation similar to that induced by DNR. However, in P-gp-positive cells, WP401 induced 2- to 5-fold more DNA fragmentation than DNR. This increased DNA strand breakage by WP401 was correlated with its increased uptake and cytotoxicity in these cell lines. Overall these results indicate that WP401 is more cytotoxic than DNR in MDR cells and that this phenomenon might be related to the reduced basicity of the amino group and increased lipophilicity of WP401. ^