9 resultados para Two-stage stochastic model
em Collection Of Biostatistics Research Archive
Resumo:
The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression based on a previous approach, Iteratively ReWeighted Partial Least Squares, i.e. IRWPLS (Marx, 1996). We compare our results with two-stage PLS (Nguyen and Rocke, 2002A; Nguyen and Rocke, 2002B) and other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying bias correction to the likelihood to avoid (quasi)separation, we often get lower classification error rates.
Resumo:
Multi-site time series studies of air pollution and mortality and morbidity have figured prominently in the literature as comprehensive approaches for estimating acute effects of air pollution on health. Hierarchical models are generally used to combine site-specific information and estimate pooled air pollution effects taking into account both within-site statistical uncertainty, and across-site heterogeneity. Within a site, characteristics of time series data of air pollution and health (small pollution effects, missing data, highly correlated predictors, non linear confounding etc.) make modelling all sources of uncertainty challenging. One potential consequence is underestimation of the statistical variance of the site-specific effects to be combined. In this paper we investigate the impact of variance underestimation on the pooled relative rate estimate. We focus on two-stage normal-normal hierarchical models and on under- estimation of the statistical variance at the first stage. By mathematical considerations and simulation studies, we found that variance underestimation does not affect the pooled estimate substantially. However, some sensitivity of the pooled estimate to variance underestimation is observed when the number of sites is small and underestimation is severe. These simulation results are applicable to any two-stage normal-normal hierarchical model for combining information of site-specific results, and they can be easily extended to more general hierarchical formulations. We also examined the impact of variance underestimation on the national average relative rate estimate from the National Morbidity Mortality Air Pollution Study and we found that variance underestimation as much as 40% has little effect on the national average.
Resumo:
In this paper, we develop Bayesian hierarchical distributed lag models for estimating associations between daily variations in summer ozone levels and daily variations in cardiovascular and respiratory (CVDRESP) mortality counts for 19 U.S. large cities included in the National Morbidity Mortality Air Pollution Study (NMMAPS) for the period 1987 - 1994. At the first stage, we define a semi-parametric distributed lag Poisson regression model to estimate city-specific relative rates of CVDRESP associated with short-term exposure to summer ozone. At the second stage, we specify a class of distributions for the true city-specific relative rates to estimate an overall effect by taking into account the variability within and across cities. We perform the calculations with respect to several random effects distributions (normal, t-student, and mixture of normal), thus relaxing the common assumption of a two-stage normal-normal hierarchical model. We assess the sensitivity of the results to: 1) lag structure for ozone exposure; 2) degree of adjustment for long-term trends; 3) inclusion of other pollutants in the model;4) heat waves; 5) random effects distributions; and 6) prior hyperparameters. On average across cities, we found that a 10ppb increase in summer ozone level for every day in the previous week is associated with 1.25 percent increase in CVDRESP mortality (95% posterior regions: 0.47, 2.03). The relative rate estimates are also positive and statistically significant at lags 0, 1, and 2. We found that associations between summer ozone and CVDRESP mortality are sensitive to the confounding adjustment for PM_10, but are robust to: 1) the adjustment for long-term trends, other gaseous pollutants (NO_2, SO_2, and CO); 2) the distributional assumptions at the second stage of the hierarchical model; and 3) the prior distributions on all unknown parameters. Bayesian hierarchical distributed lag models and their application to the NMMAPS data allow us estimation of an acute health effect associated with exposure to ambient air pollution in the last few days on average across several locations. The application of these methods and the systematic assessment of the sensitivity of findings to model assumptions provide important epidemiological evidence for future air quality regulations.
Resumo:
The AEGISS (Ascertainment and Enhancement of Gastrointestinal Infection Surveillance and Statistics) project aims to use spatio-temporal statistical methods to identify anomalies in the space-time distribution of non-specific, gastrointestinal infections in the UK, using the Southampton area in southern England as a test-case. In this paper, we use the AEGISS project to illustrate how spatio-temporal point process methodology can be used in the development of a rapid-response, spatial surveillance system. Current surveillance of gastroenteric disease in the UK relies on general practitioners reporting cases of suspected food-poisoning through a statutory notification scheme, voluntary laboratory reports of the isolation of gastrointestinal pathogens and standard reports of general outbreaks of infectious intestinal disease by public health and environmental health authorities. However, most statutory notifications are made only after a laboratory reports the isolation of a gastrointestinal pathogen. As a result, detection is delayed and the ability to react to an emerging outbreak is reduced. For more detailed discussion, see Diggle et al. (2003). A new and potentially valuable source of data on the incidence of non-specific gastro-enteric infections in the UK is NHS Direct, a 24-hour phone-in clinical advice service. NHS Direct data are less likely than reports by general practitioners to suffer from spatially and temporally localized inconsistencies in reporting rates. Also, reporting delays by patients are likely to be reduced, as no appointments are needed. Against this, NHS Direct data sacrifice specificity. Each call to NHS Direct is classified only according to the general pattern of reported symptoms (Cooper et al, 2003). The current paper focuses on the use of spatio-temporal statistical analysis for early detection of unexplained variation in the spatio-temporal incidence of non-specific gastroenteric symptoms, as reported to NHS Direct. Section 2 describes our statistical formulation of this problem, the nature of the available data and our approach to predictive inference. Section 3 describes the stochastic model. Section 4 gives the results of fitting the model to NHS Direct data. Section 5 shows how the model is used for spatio-temporal prediction. The paper concludes with a short discussion.
Resumo:
The etiology of complex diseases is heterogeneous. The presence of risk alleles in one or more genetic loci affects the function of a variety of intermediate biological pathways, resulting in the overt expression of disease. Hence, there is an increasing focus on identifying the genetic basis of disease by sytematically studying phenotypic traits pertaining to the underlying biological functions. In this paper we focus on identifying genetic loci linked to quantitative phenotypic traits in experimental crosses. Such genetic mapping methods often use a one stage design by genotyping all the markers of interest on the available subjects. A genome scan based on single locus or multi-locus models is used to identify the putative loci. Since the number of quantitative trait loci (QTLs) is very likely to be small relative to the number of markers genotyped, a one-stage selective genotyping approach is commonly used to reduce the genotyping burden, whereby markers are genotyped solely on individuals with extreme trait values. This approach is powerful in the presence of a single quantitative trait locus (QTL) but may result in substantial loss of information in the presence of multiple QTLs. Here we investigate the efficiency of sequential two stage designs to identify QTLs in experimental populations. Our investigations for backcross and F2 crosses suggest that genotyping all the markers on 60% of the subjects in Stage 1 and genotyping the chromosomes significant at 20% level using additional subjects in Stage 2 and testing using all the subjects provides an efficient approach to identify the QTLs and utilizes only 70% of the genotyping burden relative to a one stage design, regardless of the heritability and genotyping density. Complex traits are a consequence of multiple QTLs conferring main effects as well as epistatic interactions. We propose a two-stage analytic approach where a single-locus genome scan is conducted in Stage 1 to identify promising chromosomes, and interactions are examined using the loci on these chromosomes in Stage 2. We examine settings under which the two-stage analytic approach provides sufficient power to detect the putative QTLs.
Resumo:
Under a two-level hierarchical model, suppose that the distribution of the random parameter is known or can be estimated well. Data are generated via a fixed, but unobservable realization of this parameter. In this paper, we derive the smallest confidence region of the random parameter under a joint Bayesian/frequentist paradigm. On average this optimal region can be much smaller than the corresponding Bayesian highest posterior density region. The new estimation procedure is appealing when one deals with data generated under a highly parallel structure, for example, data from a trial with a large number of clinical centers involved or genome-wide gene-expession data for estimating individual gene- or center-specific parameters simultaneously. The new proposal is illustrated with a typical microarray data set and its performance is examined via a small simulation study.
Nonparametric Inference Procedure For Percentiles of the Random Effect Distribution in Meta Analysis
Resumo:
Multiple outcomes data are commonly used to characterize treatment effects in medical research, for instance, multiple symptoms to characterize potential remission of a psychiatric disorder. Often either a global, i.e. symptom-invariant, treatment effect is evaluated. Such a treatment effect may over generalize the effect across the outcomes. On the other hand individual treatment effects, varying across all outcomes, are complicated to interpret, and their estimation may lose precision relative to a global summary. An effective compromise to summarize the treatment effect may be through patterns of the treatment effects, i.e. "differentiated effects." In this paper we propose a two-category model to differentiate treatment effects into two groups. A model fitting algorithm and simulation study are presented, and several methods are developed to analyze heterogeneity presenting in the treatment effects. The method is illustrated using an analysis of schizophrenia symptom data.