985 resultados para logistic models
Resumo:
Generalized linear mixed models (GLMMs) provide an elegant framework for the analysis of correlated data. Due to the non-closed form of the likelihood, GLMMs are often fit by computational procedures like penalized quasi-likelihood (PQL). Special cases of these models are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints often make it difficult to apply these iterative procedures to data sets with very large number of cases. This paper proposes a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits sub-models of the GLMM to subsetted versions of the data. Additional gains in efficiency are achieved for Poisson models, commonly used in disease mapping problems, because of their special collapsibility property which allows data reduction through summaries. Convergence of the proposed iterative procedure is guaranteed for canonical link functions. The strategy is applied to investigate the relationship between ischemic heart disease, socioeconomic status and age/gender category in New South Wales, Australia, based on outcome data consisting of approximately 33 million records. A simulation study demonstrates the algorithm's reliability in analyzing a data set with 12 million records for a (non-collapsible) logistic regression model.
Resumo:
A patient-specific surface model of the proximal femur plays an important role in planning and supporting various computer-assisted surgical procedures including total hip replacement, hip resurfacing, and osteotomy of the proximal femur. The common approach to derive 3D models of the proximal femur is to use imaging techniques such as computed tomography (CT) or magnetic resonance imaging (MRI). However, the high logistic effort, the extra radiation (CT-imaging), and the large quantity of data to be acquired and processed make them less functional. In this paper, we present an integrated approach using a multi-level point distribution model (ML-PDM) to reconstruct a patient-specific model of the proximal femur from intra-operatively available sparse data. Results of experiments performed on dry cadaveric bones using dozens of 3D points are presented, as well as experiments using a limited number of 2D X-ray images, which demonstrate promising accuracy of the present approach.
Resumo:
OBJECTIVES This study sought to validate the Logistic Clinical SYNTAX (Synergy Between Percutaneous Coronary Intervention With Taxus and Cardiac Surgery) score in patients with non-ST-segment elevation acute coronary syndromes (ACS), in order to further legitimize its clinical application. BACKGROUND The Logistic Clinical SYNTAX score allows for an individualized prediction of 1-year mortality in patients undergoing contemporary percutaneous coronary intervention. It is composed of a "Core" Model (anatomical SYNTAX score, age, creatinine clearance, and left ventricular ejection fraction), and "Extended" Model (composed of an additional 6 clinical variables), and has previously been cross validated in 7 contemporary stent trials (>6,000 patients). METHODS One-year all-cause death was analyzed in 2,627 patients undergoing percutaneous coronary intervention from the ACUITY (Acute Catheterization and Urgent Intervention Triage Strategy) trial. Mortality predictions from the Core and Extended Models were studied with respect to discrimination, that is, separation of those with and without 1-year all-cause death (assessed by the concordance [C] statistic), and calibration, that is, agreement between observed and predicted outcomes (assessed with validation plots). Decision curve analyses, which weight the harms (false positives) against benefits (true positives) of using a risk score to make mortality predictions, were undertaken to assess clinical usefulness. RESULTS In the ACUITY trial, the median SYNTAX score was 9.0 (interquartile range 5.0 to 16.0); approximately 40% of patients had 3-vessel disease, 29% diabetes, and 85% underwent drug-eluting stent implantation. Validation plots confirmed agreement between observed and predicted mortality. The Core and Extended Models demonstrated substantial improvements in the discriminative ability for 1-year all-cause death compared with the anatomical SYNTAX score in isolation (C-statistics: SYNTAX score: 0.64, 95% confidence interval [CI]: 0.56 to 0.71; Core Model: 0.74, 95% CI: 0.66 to 0.79; Extended Model: 0.77, 95% CI: 0.70 to 0.83). Decision curve analyses confirmed the increasing ability to correctly identify patients who would die at 1 year with the Extended Model versus the Core Model versus the anatomical SYNTAX score, over a wide range of thresholds for mortality risk predictions. CONCLUSIONS Compared to the anatomical SYNTAX score alone, the Core and Extended Models of the Logistic Clinical SYNTAX score more accurately predicted individual 1-year mortality in patients presenting with non-ST-segment elevation acute coronary syndromes undergoing percutaneous coronary intervention. These findings support the clinical application of the Logistic Clinical SYNTAX score.
Resumo:
In 2011, there will be an estimated 1,596,670 new cancer cases and 571,950 cancer-related deaths in the US. With the ever-increasing applications of cancer genetics in epidemiology, there is great potential to identify genetic risk factors that would help identify individuals with increased genetic susceptibility to cancer, which could be used to develop interventions or targeted therapies that could hopefully reduce cancer risk and mortality. In this dissertation, I propose to develop a new statistical method to evaluate the role of haplotypes in cancer susceptibility and development. This model will be flexible enough to handle not only haplotypes of any size, but also a variety of covariates. I will then apply this method to three cancer-related data sets (Hodgkin Disease, Glioma, and Lung Cancer). I hypothesize that there is substantial improvement in the estimation of association between haplotypes and disease, with the use of a Bayesian mathematical method to infer haplotypes that uses prior information from known genetics sources. Analysis based on haplotypes using information from publically available genetic sources generally show increased odds ratios and smaller p-values in both the Hodgkin, Glioma, and Lung data sets. For instance, the Bayesian Joint Logistic Model (BJLM) inferred haplotype TC had a substantially higher estimated effect size (OR=12.16, 95% CI = 2.47-90.1 vs. 9.24, 95% CI = 1.81-47.2) and more significant p-value (0.00044 vs. 0.008) for Hodgkin Disease compared to a traditional logistic regression approach. Also, the effect sizes of haplotypes modeled with recessive genetic effects were higher (and had more significant p-values) when analyzed with the BJLM. Full genetic models with haplotype information developed with the BJLM resulted in significantly higher discriminatory power and a significantly higher Net Reclassification Index compared to those developed with haplo.stats for lung cancer. Future analysis for this work could be to incorporate the 1000 Genomes project, which offers a larger selection of SNPs can be incorporated into the information from known genetic sources as well. Other future analysis include testing non-binary outcomes, like the levels of biomarkers that are present in lung cancer (NNK), and extending this analysis to full GWAS studies.
Resumo:
This paper reports a comparison of three modeling strategies for the analysis of hospital mortality in a sample of general medicine inpatients in a Department of Veterans Affairs medical center. Logistic regression, a Markov chain model, and longitudinal logistic regression were evaluated on predictive performance as measured by the c-index and on accuracy of expected numbers of deaths compared to observed. The logistic regression used patient information collected at admission; the Markov model was comprised of two absorbing states for discharge and death and three transient states reflecting increasing severity of illness as measured by laboratory data collected during the hospital stay; longitudinal regression employed Generalized Estimating Equations (GEE) to model covariance structure for the repeated binary outcome. Results showed that the logistic regression predicted hospital mortality as well as the alternative methods but was limited in scope of application. The Markov chain provides insights into how day to day changes of illness severity lead to discharge or death. The longitudinal logistic regression showed that increasing illness trajectory is associated with hospital mortality. The conclusion is reached that for standard applications in modeling hospital mortality, logistic regression is adequate, but for new challenges facing health services research today, alternative methods are equally predictive, practical, and can provide new insights. ^
Resumo:
OBJECTIVES This study aimed to update the Logistic Clinical SYNTAX score to predict 3-year survival after percutaneous coronary intervention (PCI) and compare the performance with the SYNTAX score alone. BACKGROUND The SYNTAX score is a well-established angiographic tool to predict long-term outcomes after PCI. The Logistic Clinical SYNTAX score, developed by combining clinical variables with the anatomic SYNTAX score, has been shown to perform better than the SYNTAX score alone in predicting 1-year outcomes after PCI. However, the ability of this score to predict long-term survival is unknown. METHODS Patient-level data (N = 6,304, 399 deaths within 3 years) from 7 contemporary PCI trials were analyzed. We revised the overall risk and the predictor effects in the core model (SYNTAX score, age, creatinine clearance, and left ventricular ejection fraction) using Cox regression analysis to predict mortality at 3 years. We also updated the extended model by combining the core model with additional independent predictors of 3-year mortality (i.e., diabetes mellitus, peripheral vascular disease, and body mass index). RESULTS The revised Logistic Clinical SYNTAX models showed better discriminative ability than the anatomic SYNTAX score for the prediction of 3-year mortality after PCI (c-index: SYNTAX score, 0.61; core model, 0.71; and extended model, 0.73 in a cross-validation procedure). The extended model in particular performed better in differentiating low- and intermediate-risk groups. CONCLUSIONS Risk scores combining clinical characteristics with the anatomic SYNTAX score substantially better predict 3-year mortality than the SYNTAX score alone and should be used for long-term risk stratification of patients undergoing PCI.
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
My dissertation focuses on developing methods for gene-gene/environment interactions and imprinting effect detections for human complex diseases and quantitative traits. It includes three sections: (1) generalizing the Natural and Orthogonal interaction (NOIA) model for the coding technique originally developed for gene-gene (GxG) interaction and also to reduced models; (2) developing a novel statistical approach that allows for modeling gene-environment (GxE) interactions influencing disease risk, and (3) developing a statistical approach for modeling genetic variants displaying parent-of-origin effects (POEs), such as imprinting. In the past decade, genetic researchers have identified a large number of causal variants for human genetic diseases and traits by single-locus analysis, and interaction has now become a hot topic in the effort to search for the complex network between multiple genes or environmental exposures contributing to the outcome. Epistasis, also known as gene-gene interaction is the departure from additive genetic effects from several genes to a trait, which means that the same alleles of one gene could display different genetic effects under different genetic backgrounds. In this study, we propose to implement the NOIA model for association studies along with interaction for human complex traits and diseases. We compare the performance of the new statistical models we developed and the usual functional model by both simulation study and real data analysis. Both simulation and real data analysis revealed higher power of the NOIA GxG interaction model for detecting both main genetic effects and interaction effects. Through application on a melanoma dataset, we confirmed the previously identified significant regions for melanoma risk at 15q13.1, 16q24.3 and 9p21.3. We also identified potential interactions with these significant regions that contribute to melanoma risk. Based on the NOIA model, we developed a novel statistical approach that allows us to model effects from a genetic factor and binary environmental exposure that are jointly influencing disease risk. Both simulation and real data analyses revealed higher power of the NOIA model for detecting both main genetic effects and interaction effects for both quantitative and binary traits. We also found that estimates of the parameters from logistic regression for binary traits are no longer statistically uncorrelated under the alternative model when there is an association. Applying our novel approach to a lung cancer dataset, we confirmed four SNPs in 5p15 and 15q25 region to be significantly associated with lung cancer risk in Caucasians population: rs2736100, rs402710, rs16969968 and rs8034191. We also validated that rs16969968 and rs8034191 in 15q25 region are significantly interacting with smoking in Caucasian population. Our approach identified the potential interactions of SNP rs2256543 in 6p21 with smoking on contributing to lung cancer risk. Genetic imprinting is the most well-known cause for parent-of-origin effect (POE) whereby a gene is differentially expressed depending on the parental origin of the same alleles. Genetic imprinting affects several human disorders, including diabetes, breast cancer, alcoholism, and obesity. This phenomenon has been shown to be important for normal embryonic development in mammals. Traditional association approaches ignore this important genetic phenomenon. In this study, we propose a NOIA framework for a single locus association study that estimates both main allelic effects and POEs. We develop statistical (Stat-POE) and functional (Func-POE) models, and demonstrate conditions for orthogonality of the Stat-POE model. We conducted simulations for both quantitative and qualitative traits to evaluate the performance of the statistical and functional models with different levels of POEs. Our results showed that the newly proposed Stat-POE model, which ensures orthogonality of variance components if Hardy-Weinberg Equilibrium (HWE) or equal minor and major allele frequencies is satisfied, had greater power for detecting the main allelic additive effect than a Func-POE model, which codes according to allelic substitutions, for both quantitative and qualitative traits. The power for detecting the POE was the same for the Stat-POE and Func-POE models under HWE for quantitative traits.
Resumo:
Species selection for forest restoration is often supported by expert knowledge on local distribution patterns of native tree species. This approach is not applicable to largely deforested regions unless enough data on pre-human tree species distribution is available. In such regions, ecological niche models may provide essential information to support species selection in the framework of forest restoration planning. In this study we used ecological niche models to predict habitat suitability for native tree species in "Tierra de Campos" region, an almost totally deforested area of the Duero Basin (Spain). Previously available models provide habitat suitability predictions for dominant native tree species, but including non-dominant tree species in the forest restoration planning may be desirable to promote biodiversity, specially in largely deforested areas were near seed sources are not expected. We used the Forest Map of Spain as species occurrence data source to maximize the number of modeled tree species. Penalized logistic regression was used to train models using climate and lithological predictors. Using model predictions a set of tools were developed to support species selection in forest restoration planning. Model predictions were used to build ordered lists of suitable species for each cell of the study area. The suitable species lists were summarized drawing maps that showed the two most suitable species for each cell. Additionally, potential distribution maps of the suitable species for the study area were drawn. For a scenario with two dominant species, the models predicted a mixed forest (Quercus ilex and a coniferous tree species) for almost one half of the study area. According to the models, 22 non-dominant native tree species are suitable for the study area, with up to six suitable species per cell. The model predictions pointed to Crataegus monogyna, Juniperus communis, J.oxycedrus and J.phoenicea as the most suitable non-dominant native tree species in the study area. Our results encourage further use of ecological niche models for forest restoration planning in largely deforested regions.
Resumo:
Predicting failures in a distributed system based on previous events through logistic regression is a standard approach in literature. This technique is not reliable, though, in two situations: in the prediction of rare events, which do not appear in enough proportion for the algorithm to capture, and in environments where there are too many variables, as logistic regression tends to overfit on this situations; while manually selecting a subset of variables to create the model is error- prone. On this paper, we solve an industrial research case that presented this situation with a combination of elastic net logistic regression, a method that allows us to automatically select useful variables, a process of cross-validation on top of it and the application of a rare events prediction technique to reduce computation time. This process provides two layers of cross- validation that automatically obtain the optimal model complexity and the optimal mode l parameters values, while ensuring even rare events will be correctly predicted with a low amount of training instances. We tested this method against real industrial data, obtaining a total of 60 out of 80 possible models with a 90% average model accuracy.
Resumo:
Background and Objective: To examine if commonly recommended assumptions for multivariable logistic regression are addressed in two major epidemiological journals. Methods: Ninety-nine articles from the Journal of Clinical Epidemiology and the American Journal of Epidemiology were surveyed for 10 criteria: six dealing with computation and four with reporting multivariable logistic regression results. Results: Three of the 10 criteria were addressed in 50% or more of the articles. Statistical significance testing or confidence intervals were reported in all articles. Methods for selecting independent variables were described in 82%, and specific procedures used to generate the models were discussed in 65%. Fewer than 50% of the articles indicated if interactions were tested or met the recommended events per independent variable ratio of 10: 1. Fewer than 20% of the articles described conformity to a linear gradient, examined collinearity, reported information on validation procedures, goodness-of-fit, discrimination statistics, or provided complete information on variable coding. There was no significant difference (P >.05) in the proportion of articles meeting the criteria across the two journals. Conclusion: Articles reviewed frequently did not report commonly recommended assumptions for using multivariable logistic regression. (C) 2004 Elsevier Inc. All rights reserved.
Resumo:
Standard factorial designs sometimes may be inadequate for experiments that aim to estimate a generalized linear model, for example, for describing a binary response in terms of several variables. A method is proposed for finding exact designs for such experiments that uses a criterion allowing for uncertainty in the link function, the linear predictor, or the model parameters, together with a design search. Designs are assessed and compared by simulation of the distribution of efficiencies relative to locally optimal designs over a space of possible models. Exact designs are investigated for two applications, and their advantages over factorial and central composite designs are demonstrated.
Resumo:
This paper has three primary aims: to establish an effective means for modelling mainland-island metapopulations inhabiting a dynamic landscape: to investigate the effect of immigration and dynamic changes in habitat on metapopulation patch occupancy dynamics; and to illustrate the implications of our results for decision-making and population management. We first extend the mainland-island metapopulation model of Alonso and McKane [Bull. Math. Biol. 64:913-958,2002] to incorporate a dynamic landscape. It is shown, for both the static and the dynamic landscape models, that a suitably scaled version of the process converges to a unique deterministic model as the size of the system becomes large. We also establish that. under quite general conditions, the density of occupied patches, and the densities of suitable and occupied patches, for the respective models, have approximate normal distributions. Our results not only provide us with estimates for the means and variances that are valid at all stages in the evolution of the population, but also provide a tool for fitting the models to real metapopulations. We discuss the effect of immigration and habitat dynamics on metapopulations, showing that mainland-like patches heavily influence metapopulation persistence, and we argue for adopting measures to increase connectivity between this large patch and the other island-like patches. We illustrate our results with specific reference to examples of populations of butterfly and the grasshopper Bryodema tuberculata.
Resumo:
We describe methods for estimating the parameters of Markovian population processes in continuous time, thus increasing their utility in modelling real biological systems. A general approach, applicable to any finite-state continuous-time Markovian model, is presented, and this is specialised to a computationally more efficient method applicable to a class of models called density-dependent Markov population processes. We illustrate the versatility of both approaches by estimating the parameters of the stochastic SIS logistic model from simulated data. This model is also fitted to data from a population of Bay checkerspot butterfly (Euphydryas editha bayensis), allowing us to assess the viability of this population. (c) 2006 Elsevier Inc. All rights reserved.