8 resultados para Time-varying variable selection
em DigitalCommons@The Texas Medical Center
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
Health-related quality of life (HRQOL) is an important measure of the effects of chronic liver disease in affected patients that helps guide interventions to improve well-being. However, the relationship between HRQOL and survival in liver transplant candidates remains unclear. We examined whether the Physical Component Summary (PCS) and Mental Component Summary (MCS) scores from the Short Form 36 (SF-36) Health Survey were associated with survival in liver transplant candidates. We administered the SF-36 questionnaire (version 2.0) to patients in the Pulmonary Vascular Complications of Liver Disease study, a multicenter prospective cohort of patients evaluated for liver transplantation in 7 academic centers in the United States between 2003 and 2006. Cox proportional hazards models were used with death as the primary outcome and adjustment for liver transplantation as a time-varying covariate. The mean age of the 252 participants was 54 +/- 10 years, 64% were male, and 94% were white. During the 422 person years of follow-up, 147 patients (58%) were listed, 75 patients (30%) underwent transplantation, 49 patients (19%) died, and 3 patients were lost to follow-up. Lower baseline PCS scores were associated with an increased mortality rate despite adjustments for age, gender, Model for End-Stage Liver Disease score, and liver transplantation (P for the trend = 0.0001). The MCS score was not associated with mortality (P for the trend = 0.53). In conclusion, PCS significantly predicts survival in liver transplant candidates, and interventions directed toward improving the physical status may be helpful in improving outcomes in liver transplant candidates.
Resumo:
Recent treatment planning studies have demonstrated the use of physiologic images in radiation therapy treatment planning to identify regions for functional avoidance. This image-guided radiotherapy (IGRT) strategy may reduce the injury and/or functional loss following thoracic radiotherapy. 4D computed tomography (CT), developed for radiotherapy treatment planning, is a relatively new imaging technique that allows the acquisition of a time-varying sequence of 3D CT images of the patient's lungs through the respiratory cycle. Guerrero et al. developed a method to calculate ventilation imaging from 4D CT, which is potentially better suited and more broadly available for IGRT than the current standard imaging methods. The key to extracting function information from 4D CT is the construction of a volumetric deformation field that accurately tracks the motion of the patient's lungs during the respiratory cycle. The spatial accuracy of the displacement field directly impacts the ventilation images; higher spatial registration accuracy will result in less ventilation image artifacts and physiologic inaccuracies. Presently, a consistent methodology for spatial accuracy evaluation of the DIR transformation is lacking. Evaluation of the 4D CT-derived ventilation images will be performed to assess correlation with global measurements of lung ventilation, as well as regional correlation of the distribution of ventilation with the current clinical standard SPECT. This requires a novel framework for both the detailed assessment of an image registration algorithm's performance characteristics as well as quality assurance for spatial accuracy assessment in routine application. Finally, we hypothesize that hypo-ventilated regions, identified on 4D CT ventilation images, will correlate with hypo-perfused regions in lung cancer patients who have obstructive lesions. A prospective imaging trial of patients with locally advanced non-small-cell lung cancer will allow this hypothesis to be tested. These advances are intended to contribute to the validation and clinical implementation of CT-based ventilation imaging in prospective clinical trials, in which the impact of this imaging method on patient outcomes may be tested.
Resumo:
The purpose of this research and development project was to develop a method, a design, and a prototype for gathering, managing, and presenting data about occupational injuries.^ State-of-the-art systems analysis and design methodologies were applied to the long standing problem in the field of occupational safety and health of processing workplace injuries data into information for safety and health program management as well as preliminary research about accident etiologies. The top-down planning and bottom-up implementation approach was utilized to design an occupational injury management information system. A description of a managerial control system and a comprehensive system to integrate safety and health program management was provided.^ The project showed that current management information systems (MIS) theory and methods could be applied successfully to the problems of employee injury surveillance and control program performance evaluation. The model developed in the first section was applied at The University of Texas Health Science Center at Houston (UTHSCH).^ The system in current use at the UTHSCH was described and evaluated, and a prototype was developed for the UTHSCH. The prototype incorporated procedures for collecting, storing, and retrieving records of injuries and the procedures necessary to prepare reports, analyses, and graphics for management in the Health Science Center. Examples of reports, analyses, and graphics presenting UTHSCH and computer generated data were included.^ It was concluded that a pilot test of this MIS should be implemented and evaluated at the UTHSCH and other settings. Further research and development efforts for the total safety and health management information systems, control systems, component systems, and variable selection should be pursued. Finally, integration of the safety and health program MIS into the comprehensive or executive MIS was recommended. ^
Resumo:
Trastuzumab is a humanized-monoclonal antibody, developed specifically for HER2-neu over-expressed breast cancer patients. Although highly effective and well tolerated, it was reported associated with Congestive Heart Failure (CHF) in clinical trial settings (up to 27%). This leaves a gap where, Trastuzumab-related CHF rate in general population, especially older breast cancer patients with long term treatment of Trastuzumab remains unknown. This thesis examined the rates and risk factors associated with Trastuzumab-related CHF in a large population of older breast cancer patients. A retrospective cohort study using the existing Surveillance, Epidemiology and End Results (SEER) and Medicare linked de-identified database was performed. Breast cancer patients ≥ 66 years old, stage I-IV, diagnosed in 1998-2007, fully covered by Medicare but no HMO within 1-year before and after first diagnosis month, received 1st chemotherapy no earlier than 30 days prior to diagnosis were selected as study cohort. The primary outcome of this study is a diagnosis of CHF after starting chemotherapy but none CHF claims on or before cancer diagnosis date. ICD-9 and HCPCS codes were used to pool the claims for Trastuzumab use, chemotherapy, comorbidities and CHF claims. Statistical analysis including comparison of characteristics, Kaplan-Meier survival estimates of CHF rates for long term follow up, and Multivariable Cox regression model using Trastuzumab as a time-dependent variable were performed. Out of 17,684 selected cohort, 2,037 (12%) received Trastuzumab. Among them, 35% (714 out of 2037) were diagnosed with CHF, compared to 31% (4784 of 15647) of CHF rate in other chemotherapy recipients (p<.0001). After 10 years of follow-up, 65% of Trastuzumab users developed CHF, compared to 47% in their counterparts. After adjusting for patient demographic, tumor and clinical characteristics, older breast cancer patients who used Trastuzumab showed a significantly higher risk in developing CHF than other chemotherapy recipients (HR 1.69, 95% CI 1.54 - 1.85). And this risk is increased along with the increment of age (p-value < .0001). Among Trastuzumab users, these covariates also significantly increased the risk of CHF: older age, stage IV, Non-Hispanic black race, unmarried, comorbidities, Anthracyclin use, Taxane use, and lower educational level. It is concluded that, Trastuzumab users in older breast cancer patients had 69% higher risk in developing CHF than non-Trastuzumab users, much higher than the 27% increase reported in younger clinical trial patients. Older age, Non-Hispanic black race, unmarried, comorbidity, combined use with Anthracycline or Taxane also significantly increase the risk of CHF development in older patients treated with Trastuzumab. ^
Resumo:
Complex diseases, such as cancer, are caused by various genetic and environmental factors, and their interactions. Joint analysis of these factors and their interactions would increase the power to detect risk factors but is statistically. Bayesian generalized linear models using student-t prior distributions on coefficients, is a novel method to simultaneously analyze genetic factors, environmental factors, and interactions. I performed simulation studies using three different disease models and demonstrated that the variable selection performance of Bayesian generalized linear models is comparable to that of Bayesian stochastic search variable selection, an improved method for variable selection when compared to standard methods. I further evaluated the variable selection performance of Bayesian generalized linear models using different numbers of candidate covariates and different sample sizes, and provided a guideline for required sample size to achieve a high power of variable selection using Bayesian generalize linear models, considering different scales of number of candidate covariates. ^ Polymorphisms in folate metabolism genes and nutritional factors have been previously associated with lung cancer risk. In this study, I simultaneously analyzed 115 tag SNPs in folate metabolism genes, 14 nutritional factors, and all possible genetic-nutritional interactions from 1239 lung cancer cases and 1692 controls using Bayesian generalized linear models stratified by never, former, and current smoking status. SNPs in MTRR were significantly associated with lung cancer risk across never, former, and current smokers. In never smokers, three SNPs in TYMS and three gene-nutrient interactions, including an interaction between SHMT1 and vitamin B12, an interaction between MTRR and total fat intake, and an interaction between MTR and alcohol use, were also identified as associated with lung cancer risk. These lung cancer risk factors are worthy of further investigation.^
Resumo:
The development of targeted therapy involve many challenges. Our study will address some of the key issues involved in biomarker identification and clinical trial design. In our study, we propose two biomarker selection methods, and then apply them in two different clinical trial designs for targeted therapy development. In particular, we propose a Bayesian two-step lasso procedure for biomarker selection in the proportional hazards model in Chapter 2. In the first step of this strategy, we use the Bayesian group lasso to identify the important marker groups, wherein each group contains the main effect of a single marker and its interactions with treatments. In the second step, we zoom in to select each individual marker and the interactions between markers and treatments in order to identify prognostic or predictive markers using the Bayesian adaptive lasso. In Chapter 3, we propose a Bayesian two-stage adaptive design for targeted therapy development while implementing the variable selection method given in Chapter 2. In Chapter 4, we proposed an alternate frequentist adaptive randomization strategy for situations where a large number of biomarkers need to be incorporated in the study design. We also propose a new adaptive randomization rule, which takes into account the variations associated with the point estimates of survival times. In all of our designs, we seek to identify the key markers that are either prognostic or predictive with respect to treatment. We are going to use extensive simulation to evaluate the operating characteristics of our methods.^
Resumo:
Complex diseases such as cancer result from multiple genetic changes and environmental exposures. Due to the rapid development of genotyping and sequencing technologies, we are now able to more accurately assess causal effects of many genetic and environmental factors. Genome-wide association studies have been able to localize many causal genetic variants predisposing to certain diseases. However, these studies only explain a small portion of variations in the heritability of diseases. More advanced statistical models are urgently needed to identify and characterize some additional genetic and environmental factors and their interactions, which will enable us to better understand the causes of complex diseases. In the past decade, thanks to the increasing computational capabilities and novel statistical developments, Bayesian methods have been widely applied in the genetics/genomics researches and demonstrating superiority over some regular approaches in certain research areas. Gene-environment and gene-gene interaction studies are among the areas where Bayesian methods may fully exert its functionalities and advantages. This dissertation focuses on developing new Bayesian statistical methods for data analysis with complex gene-environment and gene-gene interactions, as well as extending some existing methods for gene-environment interactions to other related areas. It includes three sections: (1) Deriving the Bayesian variable selection framework for the hierarchical gene-environment and gene-gene interactions; (2) Developing the Bayesian Natural and Orthogonal Interaction (NOIA) models for gene-environment interactions; and (3) extending the applications of two Bayesian statistical methods which were developed for gene-environment interaction studies, to other related types of studies such as adaptive borrowing historical data. We propose a Bayesian hierarchical mixture model framework that allows us to investigate the genetic and environmental effects, gene by gene interactions (epistasis) and gene by environment interactions in the same model. It is well known that, in many practical situations, there exists a natural hierarchical structure between the main effects and interactions in the linear model. Here we propose a model that incorporates this hierarchical structure into the Bayesian mixture model, such that the irrelevant interaction effects can be removed more efficiently, resulting in more robust, parsimonious and powerful models. We evaluate both of the 'strong hierarchical' and 'weak hierarchical' models, which specify that both or one of the main effects between interacting factors must be present for the interactions to be included in the model. The extensive simulation results show that the proposed strong and weak hierarchical mixture models control the proportion of false positive discoveries and yield a powerful approach to identify the predisposing main effects and interactions in the studies with complex gene-environment and gene-gene interactions. We also compare these two models with the 'independent' model that does not impose this hierarchical constraint and observe their superior performances in most of the considered situations. The proposed models are implemented in the real data analysis of gene and environment interactions in the cases of lung cancer and cutaneous melanoma case-control studies. The Bayesian statistical models enjoy the properties of being allowed to incorporate useful prior information in the modeling process. Moreover, the Bayesian mixture model outperforms the multivariate logistic model in terms of the performances on the parameter estimation and variable selection in most cases. Our proposed models hold the hierarchical constraints, that further improve the Bayesian mixture model by reducing the proportion of false positive findings among the identified interactions and successfully identifying the reported associations. This is practically appealing for the study of investigating the causal factors from a moderate number of candidate genetic and environmental factors along with a relatively large number of interactions. The natural and orthogonal interaction (NOIA) models of genetic effects have previously been developed to provide an analysis framework, by which the estimates of effects for a quantitative trait are statistically orthogonal regardless of the existence of Hardy-Weinberg Equilibrium (HWE) within loci. Ma et al. (2012) recently developed a NOIA model for the gene-environment interaction studies and have shown the advantages of using the model for detecting the true main effects and interactions, compared with the usual functional model. In this project, we propose a novel Bayesian statistical model that combines the Bayesian hierarchical mixture model with the NOIA statistical model and the usual functional model. The proposed Bayesian NOIA model demonstrates more power at detecting the non-null effects with higher marginal posterior probabilities. Also, we review two Bayesian statistical models (Bayesian empirical shrinkage-type estimator and Bayesian model averaging), which were developed for the gene-environment interaction studies. Inspired by these Bayesian models, we develop two novel statistical methods that are able to handle the related problems such as borrowing data from historical studies. The proposed methods are analogous to the methods for the gene-environment interactions on behalf of the success on balancing the statistical efficiency and bias in a unified model. By extensive simulation studies, we compare the operating characteristics of the proposed models with the existing models including the hierarchical meta-analysis model. The results show that the proposed approaches adaptively borrow the historical data in a data-driven way. These novel models may have a broad range of statistical applications in both of genetic/genomic and clinical studies.