Biblioteca Digital

168 resultados para Biology, Biostatistics|Hydrology

Bayesian dual threshold design with Dirichlet distribution: An alternative way to frequentist multi-stage phase II designs in oncology

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many phase II clinical studies in oncology use two-stage frequentist design such as Simon's optimal design. However, they have a common logistical problem regarding the patient accrual at the interim. Strictly speaking, patient accrual at the end of the first stage may have to be suspended until all patients have events, success or failure. For example, when the study endpoint is six-month progression free survival, patient accrual has to be stopped until all outcomes from stage I is observed. However, study investigators may have concern when accrual is suspended after the first stage due to the loss of accrual momentum during this hiatus. We propose a two-stage phase II design that resolves the patient accrual problem due to an interim analysis, and it can be used as an alternative way to frequentist two-stage phase II studies in oncology. ^

Bayesian joint modeling of longitudinal and survival data

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The joint modeling of longitudinal and survival data is a new approach to many applications such as HIV, cancer vaccine trials and quality of life studies. There are recent developments of the methodologies with respect to each of the components of the joint model as well as statistical processes that link them together. Among these, second order polynomial random effect models and linear mixed effects models are the most commonly used for the longitudinal trajectory function. In this study, we first relax the parametric constraints for polynomial random effect models by using Dirichlet process priors, then three longitudinal markers rather than only one marker are considered in one joint model. Second, we use a linear mixed effect model for the longitudinal process in a joint model analyzing the three markers. In this research these methods were applied to the Primary Biliary Cirrhosis sequential data, which were collected from a clinical trial of primary biliary cirrhosis (PBC) of the liver. This trial was conducted between 1974 and 1984 at the Mayo Clinic. The effects of three longitudinal markers (1) Total Serum Bilirubin, (2) Serum Albumin and (3) Serum Glutamic-Oxaloacetic transaminase (SGOT) on patients' survival were investigated. Proportion of treatment effect will also be studied using the proposed joint modeling approaches. ^ Based on the results, we conclude that the proposed modeling approaches yield better fit to the data and give less biased parameter estimates for these trajectory functions than previous methods. Model fit is also improved after considering three longitudinal markers instead of one marker only. The results from analysis of proportion of treatment effects from these joint models indicate same conclusion as that from the final model of Fleming and Harrington (1991), which is Bilirubin and Albumin together has stronger impact in predicting patients' survival and as a surrogate endpoints for treatment. ^

Logistic regression models for ordinal response: A study of self-efficacy in colorectal cancer screening

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The ordinal logistic regression models are used to analyze the dependant variable with multiple outcomes that can be ranked, but have been underutilized. In this study, we describe four logistic regression models for analyzing the ordinal response variable. ^ In this methodological study, the four regression models are proposed. The first model uses the multinomial logistic model. The second is adjacent-category logit model. The third is the proportional odds model and the fourth model is the continuation-ratio model. We illustrate and compare the fit of these models using data from the survey designed by the University of Texas, School of Public Health research project PCCaSO (Promoting Colon Cancer Screening in people 50 and Over), to study the patient’s confidence in the completion colorectal cancer screening (CRCS). ^ The purpose of this study is two fold: first, to provide a synthesized review of models for analyzing data with ordinal response, and second, to evaluate their usefulness in epidemiological research, with particular emphasis on model formulation, interpretation of model coefficients, and their implications. Four ordinal logistic models that are used in this study include (1) Multinomial logistic model, (2) Adjacent-category logistic model [9], (3) Continuation-ratio logistic model [10], (4) Proportional logistic model [11]. We recommend that the analyst performs (1) goodness-of-fit tests, (2) sensitivity analysis by fitting and comparing different models.^

An empirical evaluation of the Random Forests classifier models for variable selection in a large-scale lung cancer case-control study

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^

Preclinical modeling of multi-drug cancer therapies

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Anticancer drugs typically are administered in the clinic in the form of mixtures, sometimes called combinations. Only in rare cases, however, are mixtures approved as drugs. Rather, research on mixtures tends to occur after single drugs have been approved. The goal of this research project was to develop modeling approaches that would encourage rational preclinical mixture design. To this end, a series of models were developed. First, several QSAR classification models were constructed to predict the cytotoxicity, oral clearance, and acute systemic toxicity of drugs. The QSAR models were applied to a set of over 115,000 natural compounds in order to identify promising ones for testing in mixtures. Second, an improved method was developed to assess synergistic, antagonistic, and additive effects between drugs in a mixture. This method, dubbed the MixLow method, is similar to the Median-Effect method, the de facto standard for assessing drug interactions. The primary difference between the two is that the MixLow method uses a nonlinear mixed-effects model to estimate parameters of concentration-effect curves, rather than an ordinary least squares procedure. Parameter estimators produced by the MixLow method were more precise than those produced by the Median-Effect Method, and coverage of Loewe index confidence intervals was superior. Third, a model was developed to predict drug interactions based on scores obtained from virtual docking experiments. This represents a novel approach for modeling drug mixtures and was more useful for the data modeled here than competing approaches. The model was applied to cytotoxicity data for 45 mixtures, each composed of up to 10 selected drugs. One drug, doxorubicin, was a standard chemotherapy agent and the others were well-known natural compounds including curcumin, EGCG, quercetin, and rhein. Predictions of synergism/antagonism were made for all possible fixed-ratio mixtures, cytotoxicities of the 10 best-scoring mixtures were tested, and drug interactions were assessed. Predicted and observed responses were highly correlated (r2 = 0.83). Results suggested that some mixtures allowed up to an 11-fold reduction of doxorubicin concentrations without sacrificing efficacy. Taken together, the models developed in this project present a general approach to rational design of mixtures during preclinical drug development. ^

Drinking water quality perception along the U.S.-Mexico border

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Drinking water-related exposures within populations living in the United States-Mexico border region, particularly among Hispanics, is an area that is largely unknown. Specifically, perceptions that may affect water source selection is an issue that has not been fully addressed. This study evaluates drinking water quality perceptions in a mostly Hispanic community living along the United States-Mexico border, a community also facing water scarcity issues. Using a survey that was administered during two seasons (winter and summer), data were collected from a total of 608 participants, of which 303 were living in the United States and 305 in Mexico. A (random) convenience sampling technique was used to select households and those interviewed were over 18 years of age. Statistically significant differences were observed involving country of residence (p=0.002). Specifically, those living in Mexico reported a higher use of bottled water than those living in the United States. Perception factors, especially taste, were cited as main reasons for not selecting unfiltered tap water as a primary drinking water source. Understanding what influences drinking water source preference can aid in the development of risk communication strategies regarding water quality. ^

Simulation study of joint transition and Weibull survival model with shared parameters

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This study investigates a theoretical model where a longitudinal process, that is a stationary Markov-Chain, and a Weibull survival process share a bivariate random effect. Furthermore, a Quality-of-Life adjusted survival is calculated as the weighted sum of survival time. Theoretical values of population mean adjusted survival of the described model are computed numerically. The parameters of the bivariate random effect do significantly affect theoretical values of population mean. Maximum-Likelihood and Bayesian methods are applied on simulated data to estimate the model parameters. Based on the parameter estimates, predicated population mean adjusted survival can then be calculated numerically and compared with the theoretical values. Bayesian method and Maximum-Likelihood method provide parameter estimations and population mean prediction with comparable accuracy; however Bayesian method suffers from poor convergence due to autocorrelation and inter-variable correlation. ^

DNA nucleotide excision repair gene single nucleotide polymorphisms and hereditary nonpolyposis colorectal cancer

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Hereditary nonpolyposis colorectal cancer (HNPCC) is an autosomal dominant disease caused by germline mutations in DNA mismatch repair(MMR) genes. The nucleotide excision repair(NER) pathway plays a very important role in cancer development. We systematically studied interactions between NER and MMR genes to identify NER gene single nucleotide polymorphism (SNP) risk factors that modify the effect of MMR mutations on risk for cancer in HNPCC. We analyzed data from polymorphisms in 10 NER genes that had been genotyped in HNPCC patients that carry MSH2 and MLH1 gene mutations. The influence of the NER gene SNPs on time to onset of colorectal cancer (CRC) was assessed using survival analysis and a semiparametric proportional hazard model. We found the median age of onset for CRC among MMR mutation carriers with the ERCC1 mutation was 3.9 years earlier than patients with wildtype ERCC1(median 47.7 vs 51.6, log-rank test p=0.035). The influence of Rad23B A249V SNP on age of onset of HNPCC is age dependent (likelihood ratio test p=0.0056). Interestingly, using the likelihood ratio test, we also found evidence of genetic interactions between the MMR gene mutations and SNPs in ERCC1 gene(C8092A) and XPG/ERCC5 gene(D1104H) with p-values of 0.004 and 0.042, respectively. An assessment using tree structured survival analysis (TSSA) showed distinct gene interactions in MLH1 mutation carriers and MSH2 mutation carriers. ERCC1 SNP genotypes greatly modified the age onset of HNPCC in MSH2 mutation carriers, while no effect was detected in MLH1 mutation carriers. Given the NER genes in this study play different roles in NER pathway, they may have distinct influences on the development of HNPCC. The findings of this study are very important for elucidation of the molecular mechanism of colon cancer development and for understanding why some mutation carriers of the MSH2 and MLH1 gene develop CRC early and others never develop CRC. Overall, the findings also have important implications for the development of early detection strategies and prevention as well as understanding the mechanism of colorectal carcinogenesis in HNPCC. ^

Bayesian generalized linear models for meta-analysis of diagnostic tests

Relevância:

80.00% 80.00%

Publicador:

Resumo:

With the recognition of the importance of evidence-based medicine, there is an emerging need for methods to systematically synthesize available data. Specifically, methods to provide accurate estimates of test characteristics for diagnostic tests are needed to help physicians make better clinical decisions. To provide more flexible approaches for meta-analysis of diagnostic tests, we developed three Bayesian generalized linear models. Two of these models, a bivariate normal and a binomial model, analyzed pairs of sensitivity and specificity values while incorporating the correlation between these two outcome variables. Noninformative independent uniform priors were used for the variance of sensitivity, specificity and correlation. We also applied an inverse Wishart prior to check the sensitivity of the results. The third model was a multinomial model where the test results were modeled as multinomial random variables. All three models can include specific imaging techniques as covariates in order to compare performance. Vague normal priors were assigned to the coefficients of the covariates. The computations were carried out using the 'Bayesian inference using Gibbs sampling' implementation of Markov chain Monte Carlo techniques. We investigated the properties of the three proposed models through extensive simulation studies. We also applied these models to a previously published meta-analysis dataset on cervical cancer as well as to an unpublished melanoma dataset. In general, our findings show that the point estimates of sensitivity and specificity were consistent among Bayesian and frequentist bivariate normal and binomial models. However, in the simulation studies, the estimates of the correlation coefficient from Bayesian bivariate models are not as good as those obtained from frequentist estimation regardless of which prior distribution was used for the covariance matrix. The Bayesian multinomial model consistently underestimated the sensitivity and specificity regardless of the sample size and correlation coefficient. In conclusion, the Bayesian bivariate binomial model provides the most flexible framework for future applications because of its following strengths: (1) it facilitates direct comparison between different tests; (2) it captures the variability in both sensitivity and specificity simultaneously as well as the intercorrelation between the two; and (3) it can be directly applied to sparse data without ad hoc correction. ^

Association between glycemic index and glycemic load and the risk of incident coronary heart disease among whites and African Americans with and without type 2 diabetes: The Atherosclerosis Risk in Communities study

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Several studies have examined the association between high glycemic index (GI) and glycemic load (GL) diets and the risk for coronary heart disease (CHD). However, most of these studies were conducted primarily on white populations. The primary aim of this study was to examine whether high GI and GL diets are associated with increased risk for developing CHD in whites and African Americans, non-diabetics and diabetics, and within stratifications of body mass index (BMI) and hypertension (HTN). Baseline and 17-year follow-up data from ARIC (Atherosclerosis Risk in Communities) study was used. The study population (13,051) consisted of 74% whites, 26% African Americans, 89% non-diabetics, 11% diabetics, 43% male, 57% female aged 44 to 66 years at baseline. Data from the ARIC food frequency questionnaire at baseline were analyzed to provide GI and GL indices for each subject. Increases of 25 and 30 units for GI and GL respectively were used to describe relationships on incident CHD risk. Adjusted hazard ratios for propensity score with 95% confidence intervals (CI) were used to assess associations. During 17 years of follow-up (1987 to 2004), 1,683 cases of CHD was recorded. Glycemic index was associated with 2.12 fold (95% CI: 1.05, 4.30) increased incident CHD risk for all African Americans and GL was associated with 1.14 fold (95% CI: 1.04, 1.25) increased CHD risk for all whites. In addition, GL was also an important CHD risk factor for white non-diabetics (HR=1.59; 95% CI: 1.33, 1.90). Furthermore, within stratum of BMI 23.0 to 29.9 in non-diabetics, GI was associated with an increased hazard ratio of 11.99 (95% CI: 2.31, 62.18) for CHD in African Americans, and GL was associated with 1.23 fold (1.08, 1.39) increased CHD risk in whites. Body mass index modified the effect of GI and GL on CHD risk in all whites and white non-diabetics. For HTN, both systolic blood pressure and diastolic blood pressure modified the effect on GI and GL on CHD risk in all whites and African Americans, white and African American non-diabetics, and white diabetics. Further studies should examine other factors that could influence the effects of GI and GL on CHD risk, including dietary factors, physical activity, and diet-gene interactions. ^

Outcome-based adaptive randomization for binary data in clinical trials---Compare and contrast

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Monte Carlo simulation has been conducted to investigate parameter estimation and hypothesis testing in some well known adaptive randomization procedures. The four urn models studied are Randomized Play-the-Winner (RPW), Randomized Pôlya Urn (RPU), Birth and Death Urn with Immigration (BDUI), and Drop-the-Loses Urn (DL). Two sequential estimation methods, the sequential maximum likelihood estimation (SMLE) and the doubly adaptive biased coin design (DABC), are simulated at three optimal allocation targets that minimize the expected number of failures under the assumption of constant variance of simple difference (RSIHR), relative risk (ORR), and odds ratio (OOR) respectively. Log likelihood ratio test and three Wald-type tests (simple difference, log of relative risk, log of odds ratio) are compared in different adaptive procedures. ^ Simulation results indicates that although RPW is slightly better in assigning more patients to the superior treatment, the DL method is considerably less variable and the test statistics have better normality. When compared with SMLE, DABC has slightly higher overall response rate with lower variance, but has larger bias and variance in parameter estimation. Additionally, the test statistics in SMLE have better normality and lower type I error rate, and the power of hypothesis testing is more comparable with the equal randomization. Usually, RSIHR has the highest power among the 3 optimal allocation ratios. However, the ORR allocation has better power and lower type I error rate when the log of relative risk is the test statistics. The number of expected failures in ORR is smaller than RSIHR. It is also shown that the simple difference of response rates has the worst normality among all 4 test statistics. The power of hypothesis test is always inflated when simple difference is used. On the other hand, the normality of the log likelihood ratio test statistics is robust against the change of adaptive randomization procedures. ^

A meta-analysis: Obesity and colorectal cancer screening

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Of cancer death, colorectal cancer death ranks second in the United States. Obesity is an important risk factor for colorectal cancer (1). Early detection of colorectal cancer when it is localized can effectively reduce mortality of colorectal cancer and increase survival time of patients if they are treated. Also, previous studies showed that obese women were more likely to delay breast cancer screening and cervical cancer screening than normal weight women (2-5). However, results from prior studies demonstrating the relationship between obesity and colorectal cancer screening are not consistent. This research was done to conduct a meta-analysis of previous cross-sectional studies selected from the Medline database and to evaluate the association between obesity and colorectal cancer screening. While the odds ratio was not statistically different from one, the results from this meta-analysis under the random effects model showed that obese people are slightly less likely to have colorectal cancer screening compared to normal weight individuals (OR,0.93;95% CI 0.75-1.15). This meta-analysis was particularly sensitive to one individual study (6) and the effect of obesity on colorectal cancer screening was statistically significant (OR, 0.87; 95% CI, 0.81-0.92) after removing Heo's study. Further systematic studies focused on whether the effect of obesity on colorectal cancer screening is limited to women only are suggested. ^

Continuous safety screens for randomized controlled clinical trials with blinded treatment information

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Standard methods for testing safety data are needed to ensure the safe conduct of clinical trials. In particular, objective rules for reliably identifying unsafe treatments need to be put into place to help protect patients from unnecessary harm. DMCs are uniquely qualified to evaluate accumulating unblinded data and make recommendations about the continuing safe conduct of a trial. However, it is the trial leadership who must make the tough ethical decision about stopping a trial, and they could benefit from objective statistical rules that help them judge the strength of evidence contained in the blinded data. We design early stopping rules for harm that act as continuous safety screens for randomized controlled clinical trials with blinded treatment information, which could be used by anyone, including trial investigators (and trial leadership). A Bayesian framework, with emphasis on the likelihood function, is used to allow for continuous monitoring without adjusting for multiple comparisons. Close collaboration between the statistician and the clinical investigators will be needed in order to design safety screens with good operating characteristics. Though the math underlying this procedure may be computationally intensive, implementation of the statistical rules will be easy and the continuous screening provided will give suitably early warning when real problems were to emerge. Trial investigators and trial leadership need these safety screens to help them to effectively monitor the ongoing safe conduct of clinical trials with blinded data.^

Persisting abdominal symptoms after travelers' diarrhea: Is there a genetic predisposition?

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Functional gastrointestinal disorders (FGIDs) are defined as ailments of the mid or lower gastrointestinal tract which are not attributable to any discernable anatomic or biochemical defects.1 FGIDs include functional bowel disorders, also known as persisting abdominal symptoms (PAS). Irritable bowel syndrome (IBS) is one of the most common illnesses classified under PAS.2,3 This is the first prospective study that looks at the etiology and pathogenesis of post-infectious PAS in the context of environmental exposure and genetic susceptibility in a cohort of US travelers to Mexico. Our objective was to identify infectious, genetic and environmental factors that predispose to post infectious PAS. ^ Methods. This is a secondary data analysis of a prospective study on a cohort of 704 healthy North American tourists to Cuernavaca, Morelos and Guadalajara, Jalisco in Mexico. The subjects at risk for Travelers' diarrhea were assessed for chronic abdominal symptoms on enrollment and six months after the return to the US. ^ Outcomes. PAS was defined as disturbances of mid and lower gastrointestinal system without any known pathological or radiological abnormalities, or infectious, or metabolic causes. It refers to functional bowel disease, category C of functional gastrointestinal diseases as defined by the Rome II criterion. PAS was sub classified into Irritable bowel syndrome (IBS) and functional abdominal disease (FAD). ^ IBS is defined as recurrent abdominal pain or discomfort present at least 25% and associated with improvement with defecation, change in frequency and form of stool. FAD encompasses other abdominal symptoms of chronic nature that do not meet the criteria for IBS. It includes functional diarrhea, functional constipation, functional bloating: and unspecified bowel symptoms. ^ Results. Among the 704 travelers studied, there were 202 cases of PAS. The PAS cases included 175 cases of FAD and 27 cases of IBS. PAS was more frequent among subjects who developed traveler's diarrhea in Mexico compared to travelers who remained healthy during the short term visit to Mexico (52 vs. 38; OR = 1.8; CI, 1.3–2.5, P < 0.001). A statistically significant difference was noted in the mean age of subjects with PAS compared to healthy controls (28 vs. 34 yrs; OR = 0.97, CI, 0.95–0.98; P < 0.001). Travelers who experienced multiple episodes, a later onset of diarrhea in Mexico and passed greater numbers of unformed stools were more likely to be identified in PAS group at six months. Participants who developed TD caused by enterotoxigenic E.coli in Mexico showed a 2.6 times higher risk of developing FAD (P = 0.003). Infection with Providencia ssp. also demonstrated a greater risk to developing PAS. Subjects who sought treatment for diarrhea while in Mexico also displayed a significantly lower frequency of IBS at six months follow up (OR = 0.30; CI, 0.10–0.80; P = 0.02). ^ Forty six SNPs belonging to 14 genes were studied. Seven SNPs were associated with PAS at 6 months. These included four SNPs from the Caspase Recruitment Domain-Containing Protein 15 gene (CARD15), two SNPs from Surfactant Pulmonary-Associated Protein D gene (SFTPD) and one from Decay-Accelerating Factor For Complement gene (CD55). A genetic risk score (GRS) was composed based on the 7 SNPs that showed significant association with PAS. A 20% greater risk for PAS was noted for every unit increase in GRS. The risk increased by 30% for IBS. The mean GRS was high for IBS (2.2) and PAS (1.1) compared to healthy controls (0.51). These data suggests a role for these genetic polymorphisms in defining the susceptibility to PAS. ^ Conclusions. The study allows us to identify individuals at risk for developing post infectious IBS (PI-IBS) and persisting abdominal symptoms after an episode of TD. The observations in this study will be of use in developing measures to prevent and treat post-infectious irritable bowel syndrome among travelers including pre-travel counseling, the use of vaccines, antibiotic prophylaxis or the initiation of early antimicrobial therapy. This study also provides insights into the pathogenesis of post infectious PAS and IBS. (Abstract shortened by UMI.)^

The information bottleneck method for genome-wide association studies

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^

«
1
2
3
4
5
6
7
8
...
11
12
»