3 resultados para Classification models
em DigitalCommons@The Texas Medical Center
Resumo:
Anticancer drugs typically are administered in the clinic in the form of mixtures, sometimes called combinations. Only in rare cases, however, are mixtures approved as drugs. Rather, research on mixtures tends to occur after single drugs have been approved. The goal of this research project was to develop modeling approaches that would encourage rational preclinical mixture design. To this end, a series of models were developed. First, several QSAR classification models were constructed to predict the cytotoxicity, oral clearance, and acute systemic toxicity of drugs. The QSAR models were applied to a set of over 115,000 natural compounds in order to identify promising ones for testing in mixtures. Second, an improved method was developed to assess synergistic, antagonistic, and additive effects between drugs in a mixture. This method, dubbed the MixLow method, is similar to the Median-Effect method, the de facto standard for assessing drug interactions. The primary difference between the two is that the MixLow method uses a nonlinear mixed-effects model to estimate parameters of concentration-effect curves, rather than an ordinary least squares procedure. Parameter estimators produced by the MixLow method were more precise than those produced by the Median-Effect Method, and coverage of Loewe index confidence intervals was superior. Third, a model was developed to predict drug interactions based on scores obtained from virtual docking experiments. This represents a novel approach for modeling drug mixtures and was more useful for the data modeled here than competing approaches. The model was applied to cytotoxicity data for 45 mixtures, each composed of up to 10 selected drugs. One drug, doxorubicin, was a standard chemotherapy agent and the others were well-known natural compounds including curcumin, EGCG, quercetin, and rhein. Predictions of synergism/antagonism were made for all possible fixed-ratio mixtures, cytotoxicities of the 10 best-scoring mixtures were tested, and drug interactions were assessed. Predicted and observed responses were highly correlated (r2 = 0.83). Results suggested that some mixtures allowed up to an 11-fold reduction of doxorubicin concentrations without sacrificing efficacy. Taken together, the models developed in this project present a general approach to rational design of mixtures during preclinical drug development. ^
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.