This study analyzed the relationship between fasting blood glucose (FBG) and 8-year mortality in the Hypertension Detection Follow-up Program (HDFP) population. Fasting blood glucose (FBG) was examined both as a continuous variable and by specified FBG strata: Normal (FBG 60–100 mg/dL), Impaired (FBG ≥100 and ≤125 mg/dL), and Diabetic (FBG>125 mg/dL or pre-existing diabetes) subgroups. The relationship between type 2 diabetes was examined with all-cause mortality. This thesis described and compared the characteristics of fasting blood glucose strata by recognized glucose cut-points; described the mortality rates in the various fasting blood glucose strata using Kaplan-Meier mortality curves, and compared the mortality risk of various strata using Cox Regression analysis. Overall, mortality was significantly greater among Referred Care (RC) participants compared to Stepped Care (SC) {HR = 1.17; 95% CI (1.052,1.309); p-value = 0.004}, as reported by the HDFP investigators in 1979. Compared with SC participants, the RC mortality rate was significantly higher for the Normal FBG group {HR = 1.18; 95% CI (1.029,1.363); p-value = 0.019} and the Impaired FBG group, {HR = 1.34; 95% CI (1.036,1.734); p-value = 0.026,}. However, for the diabetic group, 8-year mortality did not differ significantly between the RC and SC groups after adjusting for race, gender, age, smoking status among Diabetic individuals {HR = 1.03; 95% CI (0.816,1.303); p-value = 0.798}. This latter finding is possibly due to a lack of a treatment difference of hypertension among Diabetic participants in both RC and SC groups. The largest difference in mortality between RC and SC was in the Impaired subgroup, suggesting that hypertensive patients with FBG between 100 and 125 mg/dL would benefit from aggressive antihypertensive therapy.^


In the biomedical studies, the general data structures have been the matched (paired) and unmatched designs. Recently, many researchers are interested in Meta-Analysis to obtain a better understanding from several clinical data of a medical treatment. The hybrid design, which is combined two data structures, may create the fundamental question for statistical methods and the challenges for statistical inferences. The applied methods are depending on the underlying distribution. If the outcomes are normally distributed, we would use the classic paired and two independent sample T-tests on the matched and unmatched cases. If not, we can apply Wilcoxon signed rank and rank sum test on each case. ^ To assess an overall treatment effect on a hybrid design, we can apply the inverse variance weight method used in Meta-Analysis. On the nonparametric case, we can use a test statistic which is combined on two Wilcoxon test statistics. However, these two test statistics are not in same scale. We propose the Hybrid Test Statistic based on the Hodges-Lehmann estimates of the treatment effects, which are medians in the same scale.^ To compare the proposed method, we use the classic meta-analysis T-test statistic on the combined the estimates of the treatment effects from two T-test statistics. Theoretically, the efficiency of two unbiased estimators of a parameter is the ratio of their variances. With the concept of Asymptotic Relative Efficiency (ARE) developed by Pitman, we show ARE of the hybrid test statistic relative to classic meta-analysis T-test statistic using the Hodges-Lemann estimators associated with two test statistics.^ From several simulation studies, we calculate the empirical type I error rate and power of the test statistics. The proposed statistic would provide effective tool to evaluate and understand the treatment effect in various public health studies as well as clinical trials.^


Scholars have found that socioeconomic status was one of the key factors that influenced early-stage lung cancer incidence rates in a variety of regions. This thesis examined the association between median household income and lung cancer incidence rates in Texas counties. A total of 254 individual counties in Texas with corresponding lung cancer incidence rates from 2004 to 2008 and median household incomes in 2006 were collected from the National Cancer Institute Surveillance System. A simple linear model and spatial linear models with two structures, Simultaneous Autoregressive Structure (SAR) and Conditional Autoregressive Structure (CAR), were used to link median household income and lung cancer incidence rates in Texas. The residuals of the spatial linear models were analyzed with Moran's I and Geary's C statistics, and the statistical results were used to detect similar lung cancer incidence rate clusters and disease patterns in Texas.^


This study proposed a novel statistical method that modeled the multiple outcomes and missing data process jointly using item response theory. This method follows the "intent-to-treat" principle in clinical trials and accounts for the correlation between outcomes and missing data process. This method may provide a good solution to chronic mental disorder study. ^ The simulation study demonstrated that if the true model is the proposed model with moderate or strong correlation, ignoring the within correlation may lead to overestimate of the treatment effect and result in more type I error than specified level. Even if the within correlation is small, the performance of proposed model is as good as naïve response model. Thus, the proposed model is robust for different correlation settings if the data is generated by the proposed model.^


Background. End-stage liver disease (ESLD) is an irreversible condition that leads to the imminent complete failure of the liver. Orthotopic liver transplantation (OLT) has been well accepted as the best curative option for patients with ESLD. Despite the progress in liver transplantation, the major limitation nowadays is the discrepancy between donor supply and organ demand. In an effort to alleviate this situation, mismatched donor and recipient gender or race livers are being used. However, the simultaneous impact of donor and recipient gender and race mismatching on patient survival after OLT remains unclear and relatively challenging to surgeons. ^ Objective. To examine the impact of donor and recipient gender and race mismatching on patient survival after OLT using the United Network for Organ Sharing (UNOS) database. ^ Methods. A total of 40,644 recipients who underwent OLT between 2002 and 2011 were included. Kaplan-Meier survival curves and the log-rank tests were used to compare the survival rates among different donor-recipient gender and race combinations. Univariate Cox regression analysis was used to assess the association of donor-recipient gender and race mismatching with patient survival after OLT. Multivariable Cox regression analysis was used to model the simultaneous impact of donor-recipient gender and race mismatching on patient survival after OLT adjusting for a list of other risk factors. Multivariable Cox regression analysis stratifying on recipient hepatitis C virus (HCV) status was also conducted to identify the variables that were differentially associated with patient survival in HCV + and HCV − recipients. ^ Results. In the univariate analysis, compared to male donors to male recipients, female donors to male recipients had a higher risk of patient mortality (HR, 1.122; 95% CI, 1.065–1.183), while in the multivariable analysis, male donors to female recipients experienced an increased mortality rates (adjusted HR, 1.114; 95% CI, 1.048–1.184). Compared to white donors to white recipients, Hispanic donors to black recipients had a higher risk of patient mortality (HR, 1.527; 95% CI, 1.293–1.804) in the univariate analysis, and similar result (adjusted HR, 1.553; 95% CI, 1.314–1.836) was noted in multivariable analysis. After the stratification on recipient HCV status in the multivariable analysis, HCV + mismatched recipients appeared to be at greater risk of mortality than HCV − mismatched recipients. Female donors to female HCV − recipients (adjusted HR, 0.843; 95% CI, 0.769–0.923), and Hispanic HCV + recipients receiving livers from black donors (adjusted HR, 0.758; 95% CI, 0.598–0.960) had a protective effect on patient survival after OLT. ^ Conclusion. Donor-recipient gender and race mismatching adversely affect patient survival after OLT, both independently and after the adjustment for other risk factors. Female recipient HCV status is an important effect modifier in the association between donor-recipient gender combination and patient survival.^


The infant mortality rate (IMR) is considered to be one of the most important indices of a country's well-being. Countries around the world and other health organizations like the World Health Organization are dedicating their resources, knowledge and energy to reduce the infant mortality rates. The well-known Millennium Development Goal 4 (MDG 4), whose aim is to archive a two thirds reduction of the under-five mortality rate between 1990 and 2015, is an example of the commitment. ^ In this study our goal is to model the trends of IMR between the 1950s to 2010s for selected countries. We would like to know how the IMR is changing overtime and how it differs across countries. ^ IMR data collected over time forms a time series. The repeated observations of IMR time series are not statistically independent. So in modeling the trend of IMR, it is necessary to account for these correlations. We proposed to use the generalized least squares method in general linear models setting to deal with the variance-covariance structure in our model. In order to estimate the variance-covariance matrix, we referred to the time-series models, especially the autoregressive and moving average models. Furthermore, we will compared results from general linear model with correlation structure to that from ordinary least squares method without taking into account the correlation structure to check how significantly the estimates change.^


Life expectancy has consistently increased over the last 150 years due to improvements in nutrition, medicine, and public health. Several studies found that in many developed countries, life expectancy continued to rise following a nearly linear trend, which was contrary to a common belief that the rate of improvement in life expectancy would decelerate and was fit with an S-shaped curve. Using samples of countries that exhibited a wide range of economic development levels, we explored the change in life expectancy over time by employing both nonlinear and linear models. We then observed if there were any significant differences in estimates between linear models, assuming an auto-correlated error structure. When data did not have a sigmoidal shape, nonlinear growth models sometimes failed to provide meaningful parameter estimates. The existence of an inflection point and asymptotes in the growth models made them inflexible with life expectancy data. In linear models, there was no significant difference in the life expectancy growth rate and future estimates between ordinary least squares (OLS) and generalized least squares (GLS). However, the generalized least squares model was more robust because the data involved time-series variables and residuals were positively correlated. ^


There are two practical challenges in the phase I clinical trial conduct: lack of transparency to physicians, and the late onset toxicity. In my dissertation, Bayesian approaches are used to address these two problems in clinical trial designs. The proposed simple optimal designs cast the dose finding problem as a decision making process for dose escalation and deescalation. The proposed designs minimize the incorrect decision error rate to find the maximum tolerated dose (MTD). For the late onset toxicity problem, a Bayesian adaptive dose-finding design for drug combination is proposed. The dose-toxicity relationship is modeled using the Finney model. The unobserved delayed toxicity outcomes are treated as missing data and Bayesian data augment is employed to handle the resulting missing data. Extensive simulation studies have been conducted to examine the operating characteristics of the proposed designs and demonstrated the designs' good performances in various practical scenarios.^


Early phase clinical trial designs have long been the focus of interest for clinicians and statisticians working in oncology field. There are several standard phse I and phase II designs that have been widely-implemented in medical practice. For phase I design, the most commonly used methods are 3+3 and CRM. A newly-developed Bayesian model-based mTPI design has now been used by an increasing number of hospitals and pharmaceutical companies. The advantages and disadvantages of these three top phase I designs have been discussed in my work here and their performances were compared using simulated data. It was shown that mTPI design exhibited superior performance in most scenarios in comparison with 3+3 and CRM designs. ^ The next major part of my work is proposing an innovative seamless phase I/II design that allows clinicians to conduct phase I and phase II clinical trials simultaneously. Bayesian framework was implemented throughout the whole design. The phase I portion of the design adopts mTPI method, with the addition of futility rule which monitors the efficacy performance of the tested drugs. Dose graduation rules were proposed in this design to allow doses move forward from phase I portion of the study to phase II portion without interrupting the ongoing phase I dose-finding schema. Once a dose graduated to phase II, adaptive randomization was used to randomly allocated patients into different treatment arms, with the intention of more patients being assigned to receive more promising dose(s). Again simulations were performed to compare the performance of this innovative phase I/II design with a recently published phase I/II design, together with the conventional phase I and phase II designs. The simulation results indicated that the seamless phase I/II design outperform the other two competing methods in most scenarios, with superior trial power and the fact that it requires smaller sample size. It also significantly reduces the overall study time. ^ Similar to other early phase clinical trial designs, the proposed seamless phase I/II design requires that the efficacy and safety outcomes being able to be observed in a short time frame. This limitation can be overcome by using validated surrogate marker for the efficacy and safety endpoints.^


Mixture modeling is commonly used to model categorical latent variables that represent subpopulations in which population membership is unknown but can be inferred from the data. In relatively recent years, the potential of finite mixture models has been applied in time-to-event data. However, the commonly used survival mixture model assumes that the effects of the covariates involved in failure times differ across latent classes, but the covariate distribution is homogeneous. The aim of this dissertation is to develop a method to examine time-to-event data in the presence of unobserved heterogeneity under a framework of mixture modeling. A joint model is developed to incorporate the latent survival trajectory along with the observed information for the joint analysis of a time-to-event variable, its discrete and continuous covariates, and a latent class variable. It is assumed that the effects of covariates on survival times and the distribution of covariates vary across different latent classes. The unobservable survival trajectories are identified through estimating the probability that a subject belongs to a particular class based on observed information. We applied this method to a Hodgkin lymphoma study with long-term follow-up and observed four distinct latent classes in terms of long-term survival and distributions of prognostic factors. Our results from simulation studies and from the Hodgkin lymphoma study demonstrated the superiority of our joint model compared with the conventional survival model. This flexible inference method provides more accurate estimation and accommodates unobservable heterogeneity among individuals while taking involved interactions between covariates into consideration.^


Schizophrenia (SZ) is a complex disorder with high heritability and variable phenotypes that has limited success in finding causal genes associated with the disease development. Pathway-based analysis is an effective approach in investigating the molecular mechanism of susceptible genes associated with complex diseases. The etiology of complex diseases could be a network of genetic factors and within the genes, interaction may occur. In this work we argue that some genes might be of small effect that by itself are neither sufficient nor necessary to cause the disease however, their effect may induce slight changes to the gene expression or affect the protein function, therefore, analyzing the gene-gene interaction mechanism within the disease pathway would play crucial role in dissecting the genetic architecture of complex diseases, making the pathway-based analysis a complementary approach to GWAS technique. ^ In this study, we implemented three novel linkage disequilibrium based statistics, the linear combination, the quadratic, and the decorrelation test statistics, to investigate the interaction between linked and unlinked genes in two independent case-control GWAS datasets for SZ including participants of European (EA) and African (AA) ancestries. The EA population included 1,173 cases and 1,378 controls with 729,454 genotyped SNPs, while the AA population included 219 cases and 288 controls with 845,814 genotyped SNPs. We identified 17,186 interacting gene-sets at significant level in EA dataset, and 12,691 gene-sets in AA dataset using the gene-gene interaction method. We also identified 18,846 genes in EA dataset and 19,431 genes in AA dataset that were in the disease pathways. However, few genes were reported of significant association to SZ. ^ Our research determined the pathways characteristics for schizophrenia through the gene-gene interaction and gene-pathway based approaches. Our findings suggest insightful inferences of our methods in studying the molecular mechanisms of common complex diseases.^


Background: Little is known about the effects on patient adherence when the same study drug is administered in the same dose in two populations with two different diseases in two different clinical trials. The Minocycline in Rheumatoid Arthritis (MIRA) trial and the NIH Exploratory Trials in Parkinson's disease (NET-PD) Futility Study I provide a unique opportunity to do the above and to compare methods measuring adherence. This study may increase understanding of the influence of disease and adverse events on patient adherence and will provide insights to investigators selecting adherence assessment methods in clinical trials of minocycline and other drugs in future.^ Methods: Minocycline adherence by pill count and the effect of adverse events was compared in the MIRA and NET-PD FS1 trials using multivariable linear regression. Within the MIRA trial, agreement between assay and pill count was compared. The association of adverse events with assay adherence was examined using multivariable logistic regression.^ Results: Adherence derived from pill count in the MIRA and NET-PD FS1 trials did not differ significantly. Adverse events potentially related to minocycline did not appear useful to predict minocycline adherence. In the MIRA trial, adherence measured by pill count appears higher than adherence measured by assay. Agreement between pill count and assay was poor (kappa statistic = 0.25).^ Limitations: Trial and disease are completely confounded and hence the independent effect of disease on adherence to minocycline treatment cannot be studied.^ Conclusion: Simple pill count may be preferred over assay in the minocycline clinical trials to measure adherence. Assays may be less sensitive in a clinical setting where appointments are not scheduled in relation to medication administration time, given assays depend on many pharmacokinetic and instrument-related factors. However, pill count can be manipulated by the patient. Another study suggested that self-report method is more sensitive than pill count method in differentiating adherence from non-adherence. An effect of medication-related adverse events on adherence could not be detected.^


It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.


Next-generation sequencing (NGS) technology has become a prominent tool in biological and biomedical research. However, NGS data analysis, such as de novo assembly, mapping and variants detection is far from maturity, and the high sequencing error-rate is one of the major problems. . To minimize the impact of sequencing errors, we developed a highly robust and efficient method, MTM, to correct the errors in NGS reads. We demonstrated the effectiveness of MTM on both single-cell data with highly non-uniform coverage and normal data with uniformly high coverage, reflecting that MTM’s performance does not rely on the coverage of the sequencing reads. MTM was also compared with Hammer and Quake, the best methods for correcting non-uniform and uniform data respectively. For non-uniform data, MTM outperformed both Hammer and Quake. For uniform data, MTM showed better performance than Quake and comparable results to Hammer. By making better error correction with MTM, the quality of downstream analysis, such as mapping and SNP detection, was improved. SNP calling is a major application of NGS technologies. However, the existence of sequencing errors complicates this process, especially for the low coverage (


Cardiovascular disease (CVD) is a threat to public health. It has been reported to be the leading cause of death in United States. The invention of next generation sequencing (NGS) technology has revolutionized the biomedical research. To investigate NGS data of CVD related quantitative traits would contribute to address the unknown etiology and disease mechanism of CVD. NHLBI's Exome Sequencing Project (ESP) contains CVD related phenotypes and their associated NGS exomes sequence data. Initially, a subset of next generation sequencing data consisting of 13 CVD-related quantitative traits was investigated. Only 6 traits, systolic blood pressure (SBP), diastolic blood pressure (DBP), height, platelet counts, waist circumference, and weight, were analyzed by functional linear model (FLM) and 7 currently existing methods. FLM outperformed all currently existing methods by identifying the highest number of significant genes and had identified 96, 139, 756, 1162, 1106, and 298 genes associated with SBP, DBP, Height, Platelet, Waist, and Weight respectively. ^