863 resultados para random regression model
Resumo:
The aim of this thesis is to apply multilevel regression model in context of household surveys. Hierarchical structure in this type of data is characterized by many small groups. In last years comparative and multilevel analysis in the field of perceived health have grown in size. The purpose of this thesis is to develop a multilevel analysis with three level of hierarchy for Physical Component Summary outcome to: evaluate magnitude of within and between variance at each level (individual, household and municipality); explore which covariates affect on perceived physical health at each level; compare model-based and design-based approach in order to establish informativeness of sampling design; estimate a quantile regression for hierarchical data. The target population are the Italian residents aged 18 years and older. Our study shows a high degree of homogeneity within level 1 units belonging from the same group, with an intraclass correlation of 27% in a level-2 null model. Almost all variance is explained by level 1 covariates. In fact, in our model the explanatory variables having more impact on the outcome are disability, unable to work, age and chronic diseases (18 pathologies). An additional analysis are performed by using novel procedure of analysis :"Linear Quantile Mixed Model", named "Multilevel Linear Quantile Regression", estimate. This give us the possibility to describe more generally the conditional distribution of the response through the estimation of its quantiles, while accounting for the dependence among the observations. This has represented a great advantage of our models with respect to classic multilevel regression. The median regression with random effects reveals to be more efficient than the mean regression in representation of the outcome central tendency. A more detailed analysis of the conditional distribution of the response on other quantiles highlighted a differential effect of some covariate along the distribution.
Resumo:
Indoor radon is regularly measured in Switzerland. However, a nationwide model to predict residential radon levels has not been developed. The aim of this study was to develop a prediction model to assess indoor radon concentrations in Switzerland. The model was based on 44,631 measurements from the nationwide Swiss radon database collected between 1994 and 2004. Of these, 80% randomly selected measurements were used for model development and the remaining 20% for an independent model validation. A multivariable log-linear regression model was fitted and relevant predictors selected according to evidence from the literature, the adjusted R², the Akaike's information criterion (AIC), and the Bayesian information criterion (BIC). The prediction model was evaluated by calculating Spearman rank correlation between measured and predicted values. Additionally, the predicted values were categorised into three categories (50th, 50th-90th and 90th percentile) and compared with measured categories using a weighted Kappa statistic. The most relevant predictors for indoor radon levels were tectonic units and year of construction of the building, followed by soil texture, degree of urbanisation, floor of the building where the measurement was taken and housing type (P-values <0.001 for all). Mean predicted radon values (geometric mean) were 66 Bq/m³ (interquartile range 40-111 Bq/m³) in the lowest exposure category, 126 Bq/m³ (69-215 Bq/m³) in the medium category, and 219 Bq/m³ (108-427 Bq/m³) in the highest category. Spearman correlation between predictions and measurements was 0.45 (95%-CI: 0.44; 0.46) for the development dataset and 0.44 (95%-CI: 0.42; 0.46) for the validation dataset. Kappa coefficients were 0.31 for the development and 0.30 for the validation dataset, respectively. The model explained 20% overall variability (adjusted R²). In conclusion, this residential radon prediction model, based on a large number of measurements, was demonstrated to be robust through validation with an independent dataset. The model is appropriate for predicting radon level exposure of the Swiss population in epidemiological research. Nevertheless, some exposure misclassification and regression to the mean is unavoidable and should be taken into account in future applications of the model.
Resumo:
Parents and children, starting at very young ages, discuss religious and spiritual issues¿where we come from, what happens to us after we die, is there a God, and so on. Unfortunately, few studies have analyzed the content and structure of parent-child conversation about religion and spirituality (Boyatzis & Janicki, 2003; Dollahite & Thatcher, 2009), and most studies have relied on self-report with no direct observation. The current study examined mother-child (M-C) spiritual discourse to learn about its content, structure, and frequency through a survey inventory in combination with direct video observation using a novel structured task. We also analyzed how mothers¿ religiosity along several major dimensions related to their communication behaviors within both methods. Mothers (N = 39, M age = 40) of children aged 3-12 completed a survey packet on M-C spiritual discourse and standard measures of mothers¿ religious fundamentalism, intrinsic religiosity, sanctification of parenting (how much the mother saw herself as doing God¿s work as a parent), and a new measure of parental openness to children¿s spirituality. Then, in a structured task in our lab, mothers (N = 33) and children (M age = 7.33) watched a short film or read a short book that explored death in an age-appropriate manner and then engaged in a videotaped conversation about the movie or book and their religious or spiritual beliefs. Frequency of M-C spiritual discourse was positively related to mothers¿ religious fundamentalism (r = .71, p = .00), intrinsic religiosity (r = .77, p = .00), and sanctification of parenting (r = .79, p = .00), but, surprisingly, was inversely related to mothers¿ v openness to child¿s spirituality (r = -.52, p = .00). Survey data showed that the two most common topics discussed were God (once a week) and religion as it relates to moral issues (once a week). According to mothers their children¿s most common method of initiating spiritual discourse was to repeat what he or she has heard parents or family say about religious issues (M = 2.97; once a week); mothers¿ most common method was to describe their own religious/spiritual beliefs (M = 2.92). Spiritual discourse most commonly occurred either at bedtime or mealtime as reported by 26% of mothers, with the most common triggers reported as daily routine/random thoughts (once a week) and observations of nature (once a week). Mothers¿ most important goals for spiritual discourse were to let their children know that they love them (M = 3.72; very important) and to help them become a good and moral person (M = 3.67; very important). A regression model showed that significant variance in frequency of mother-child spiritual discourse (R2 = .84, p = .00) was predicted by the mothers¿ importance of goals during discourse (ß = 0.46, p = .00), frequency that the mother¿s spirituality was deepened through spiritual discourse (ß = 0.39, p = .00), and the mother¿s fundamentalism (ß = 0.20, p = .05). In a separate regression, the mother¿s comfort in the structured task (ß = 0.70, p = .00), and the number of open-ended questions she asked (ß = -0.26, p = .03) predicted the reciprocity between mother and child (R2 = .62, p = .00). In addition, the mother¿s age (ß = 0.22, p = .059) and comfort during the task (ß = 0.73, p = .00) predicted the child¿s engagement within the structured task. Other findings and theoretical and methodological implications will be discussed.
Resumo:
The construction of a reliable, practically useful prediction rule for future response is heavily dependent on the "adequacy" of the fitted regression model. In this article, we consider the absolute prediction error, the expected value of the absolute difference between the future and predicted responses, as the model evaluation criterion. This prediction error is easier to interpret than the average squared error and is equivalent to the mis-classification error for the binary outcome. We show that the distributions of the apparent error and its cross-validation counterparts are approximately normal even under a misspecified fitted model. When the prediction rule is "unsmooth", the variance of the above normal distribution can be estimated well via a perturbation-resampling method. We also show how to approximate the distribution of the difference of the estimated prediction errors from two competing models. With two real examples, we demonstrate that the resulting interval estimates for prediction errors provide much more information about model adequacy than the point estimates alone.
Resumo:
Latent class regression models are useful tools for assessing associations between covariates and latent variables. However, evaluation of key model assumptions cannot be performed using methods from standard regression models due to the unobserved nature of latent outcome variables. This paper presents graphical diagnostic tools to evaluate whether or not latent class regression models adhere to standard assumptions of the model: conditional independence and non-differential measurement. An integral part of these methods is the use of a Markov Chain Monte Carlo estimation procedure. Unlike standard maximum likelihood implementations for latent class regression model estimation, the MCMC approach allows us to calculate posterior distributions and point estimates of any functions of parameters. It is this convenience that allows us to provide the diagnostic methods that we introduce. As a motivating example we present an analysis focusing on the association between depression and socioeconomic status, using data from the Epidemiologic Catchment Area study. We consider a latent class regression analysis investigating the association between depression and socioeconomic status measures, where the latent variable depression is regressed on education and income indicators, in addition to age, gender, and marital status variables. While the fitted latent class regression model yields interesting results, the model parameters are found to be invalid due to the violation of model assumptions. The violation of these assumptions is clearly identified by the presented diagnostic plots. These methods can be applied to standard latent class and latent class regression models, and the general principle can be extended to evaluate model assumptions in other types of models.
Resumo:
In this paper, we develop Bayesian hierarchical distributed lag models for estimating associations between daily variations in summer ozone levels and daily variations in cardiovascular and respiratory (CVDRESP) mortality counts for 19 U.S. large cities included in the National Morbidity Mortality Air Pollution Study (NMMAPS) for the period 1987 - 1994. At the first stage, we define a semi-parametric distributed lag Poisson regression model to estimate city-specific relative rates of CVDRESP associated with short-term exposure to summer ozone. At the second stage, we specify a class of distributions for the true city-specific relative rates to estimate an overall effect by taking into account the variability within and across cities. We perform the calculations with respect to several random effects distributions (normal, t-student, and mixture of normal), thus relaxing the common assumption of a two-stage normal-normal hierarchical model. We assess the sensitivity of the results to: 1) lag structure for ozone exposure; 2) degree of adjustment for long-term trends; 3) inclusion of other pollutants in the model;4) heat waves; 5) random effects distributions; and 6) prior hyperparameters. On average across cities, we found that a 10ppb increase in summer ozone level for every day in the previous week is associated with 1.25 percent increase in CVDRESP mortality (95% posterior regions: 0.47, 2.03). The relative rate estimates are also positive and statistically significant at lags 0, 1, and 2. We found that associations between summer ozone and CVDRESP mortality are sensitive to the confounding adjustment for PM_10, but are robust to: 1) the adjustment for long-term trends, other gaseous pollutants (NO_2, SO_2, and CO); 2) the distributional assumptions at the second stage of the hierarchical model; and 3) the prior distributions on all unknown parameters. Bayesian hierarchical distributed lag models and their application to the NMMAPS data allow us estimation of an acute health effect associated with exposure to ambient air pollution in the last few days on average across several locations. The application of these methods and the systematic assessment of the sensitivity of findings to model assumptions provide important epidemiological evidence for future air quality regulations.
Resumo:
This paper proposes Poisson log-linear multilevel models to investigate population variability in sleep state transition rates. We specifically propose a Bayesian Poisson regression model that is more flexible, scalable to larger studies, and easily fit than other attempts in the literature. We further use hierarchical random effects to account for pairings of individuals and repeated measures within those individuals, as comparing diseased to non-diseased subjects while minimizing bias is of epidemiologic importance. We estimate essentially non-parametric piecewise constant hazards and smooth them, and allow for time varying covariates and segment of the night comparisons. The Bayesian Poisson regression is justified through a re-derivation of a classical algebraic likelihood equivalence of Poisson regression with a log(time) offset and survival regression assuming piecewise constant hazards. This relationship allows us to synthesize two methods currently used to analyze sleep transition phenomena: stratified multi-state proportional hazards models and log-linear models with GEE for transition counts. An example data set from the Sleep Heart Health Study is analyzed.
Resumo:
Background mortality is an essential component of any forest growth and yield model. Forecasts of mortality contribute largely to the variability and accuracy of model predictions at the tree, stand and forest level. In the present study, I implement and evaluate state-of-the-art techniques to increase the accuracy of individual tree mortality models, similar to those used in many of the current variants of the Forest Vegetation Simulator, using data from North Idaho and Montana. The first technique addresses methods to correct for bias induced by measurement error typically present in competition variables. The second implements survival regression and evaluates its performance against the traditional logistic regression approach. I selected the regression calibration (RC) algorithm as a good candidate for addressing the measurement error problem. Two logistic regression models for each species were fitted, one ignoring the measurement error, which is the “naïve” approach, and the other applying RC. The models fitted with RC outperformed the naïve models in terms of discrimination when the competition variable was found to be statistically significant. The effect of RC was more obvious where measurement error variance was large and for more shade-intolerant species. The process of model fitting and variable selection revealed that past emphasis on DBH as a predictor variable for mortality, while producing models with strong metrics of fit, may make models less generalizable. The evaluation of the error variance estimator developed by Stage and Wykoff (1998), and core to the implementation of RC, in different spatial patterns and diameter distributions, revealed that the Stage and Wykoff estimate notably overestimated the true variance in all simulated stands, but those that are clustered. Results show a systematic bias even when all the assumptions made by the authors are guaranteed. I argue that this is the result of the Poisson-based estimate ignoring the overlapping area of potential plots around a tree. Effects, especially in the application phase, of the variance estimate justify suggested future efforts of improving the accuracy of the variance estimate. The second technique implemented and evaluated is a survival regression model that accounts for the time dependent nature of variables, such as diameter and competition variables, and the interval-censored nature of data collected from remeasured plots. The performance of the model is compared with the traditional logistic regression model as a tool to predict individual tree mortality. Validation of both approaches shows that the survival regression approach discriminates better between dead and alive trees for all species. In conclusion, I showed that the proposed techniques do increase the accuracy of individual tree mortality models, and are a promising first step towards the next generation of background mortality models. I have also identified the next steps to undertake in order to advance mortality models further.
Resumo:
Monte Carlo simulation was used to evaluate properties of a simple Bayesian MCMC analysis of the random effects model for single group Cormack-Jolly-Seber capture-recapture data. The MCMC method is applied to the model via a logit link, so parameters p, S are on a logit scale, where logit(S) is assumed to have, and is generated from, a normal distribution with mean μ and variance σ2 . Marginal prior distributions on logit(p) and μ were independent normal with mean zero and standard deviation 1.75 for logit(p) and 100 for μ ; hence minimally informative. Marginal prior distribution on σ2 was placed on τ2=1/σ2 as a gamma distribution with α=β=0.001 . The study design has 432 points spread over 5 factors: occasions (t) , new releases per occasion (u), p, μ , and σ . At each design point 100 independent trials were completed (hence 43,200 trials in total), each with sample size n=10,000 from the parameter posterior distribution. At 128 of these design points comparisons are made to previously reported results from a method of moments procedure. We looked at properties of point and interval inference on μ , and σ based on the posterior mean, median, and mode and equal-tailed 95% credibility interval. Bayesian inference did very well for the parameter μ , but under the conditions used here, MCMC inference performance for σ was mixed: poor for sparse data (i.e., only 7 occasions) or σ=0 , but good when there were sufficient data and not small σ .
Resumo:
BACKGROUND Several treatment strategies are available for adults with advanced-stage Hodgkin's lymphoma, but studies assessing two alternative standards of care-increased dose bleomycin, etoposide, doxorubicin, cyclophosphamide, vincristine, procarbazine, and prednisone (BEACOPPescalated), and doxorubicin, bleomycin, vinblastine, and dacarbazine (ABVD)-were not powered to test differences in overall survival. To guide treatment decisions in this population of patients, we did a systematic review and network meta-analysis to identify the best initial treatment strategy. METHODS We searched the Cochrane Library, Medline, and conference proceedings for randomised controlled trials published between January, 1980, and June, 2013, that assessed overall survival in patients with advanced-stage Hodgkin's lymphoma given BEACOPPbaseline, BEACOPPescalated, BEACOPP variants, ABVD, cyclophosphamide (mechlorethamine), vincristine, procarbazine, and prednisone (C[M]OPP), hybrid or alternating chemotherapy regimens with ABVD as the backbone (eg, COPP/ABVD, MOPP/ABVD), or doxorubicin, vinblastine, mechlorethamine, vincristine, bleomycin, etoposide, and prednisone combined with radiation therapy (the Stanford V regimen). We assessed studies for eligibility, extracted data, and assessed their quality. We then pooled the data and used a Bayesian random-effects model to combine direct comparisons with indirect evidence. We also reconstructed individual patient survival data from published Kaplan-Meier curves and did standard random-effects Poisson regression. Results are reported relative to ABVD. The primary outcome was overall survival. FINDINGS We screened 2055 records and identified 75 papers covering 14 eligible trials that assessed 11 different regimens in 9993 patients, providing 59 651 patient-years of follow-up. 1189 patients died, and the median follow-up was 5·9 years (IQR 4·9-6·7). Included studies were of high methodological quality, and between-trial heterogeneity was negligible (τ(2)=0·01). Overall survival was highest in patients who received six cycles of BEACOPPescalated (HR 0·38, 95% credibility interval [CrI] 0·20-0·75). Compared with a 5 year survival of 88% for ABVD, the survival benefit for six cycles of BEACOPPescalated is 7% (95% CrI 3-10)-ie, a 5 year survival of 95%. Reconstructed individual survival data showed that, at 5 years, BEACOPPescalated has a 10% (95% CI 3-15) advantage over ABVD in overall survival. INTERPRETATION Six cycles of BEACOPPescalated significantly improves overall survival compared with ABVD and other regimens, and thus we recommend this treatment strategy as standard of care for patients with access to the appropriate supportive care.
Resumo:
OBJECTIVES Although the use of an adjudication committee (AC) for outcomes is recommended in randomized controlled trials, there are limited data on the process of adjudication. We therefore aimed to assess whether the reporting of the adjudication process in venous thromboembolism (VTE) trials meets existing quality standards and which characteristics of trials influence the use of an AC. STUDY DESIGN AND SETTING We systematically searched MEDLINE and the Cochrane Library from January 1, 2003, to June 1, 2012, for randomized controlled trials on VTE. We abstracted information about characteristics and quality of trials and reporting of adjudication processes. We used stepwise backward logistic regression model to identify trial characteristics independently associated with the use of an AC. RESULTS We included 161 trials. Of these, 68.9% (111 of 161) reported the use of an AC. Overall, 99.1% (110 of 111) of trials with an AC used independent or blinded ACs, 14.4% (16 of 111) reported how the adjudication decision was reached within the AC, and 4.5% (5 of 111) reported on whether the reliability of adjudication was assessed. In multivariate analyses, multicenter trials [odds ratio (OR), 8.6; 95% confidence interval (CI): 2.7, 27.8], use of a data safety-monitoring board (OR, 3.7; 95% CI: 1.2, 11.6), and VTE as the primary outcome (OR, 5.7; 95% CI: 1.7, 19.4) were associated with the use of an AC. Trials without random allocation concealment (OR, 0.3; 95% CI: 0.1, 0.8) and open-label trials (OR, 0.3; 95% CI: 0.1, 1.0) were less likely to report an AC. CONCLUSION Recommended processes of adjudication are underreported and lack standardization in VTE-related clinical trials. The use of an AC varies substantially by trial characteristics.
Resumo:
Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.
Resumo:
Ecosystems are faced with high rates of species loss which has consequences for their functions and services. To assess the effects of plant species diversity on the nitrogen (N) cycle, we developed a model for monthly mean nitrate (NO3-N) concentrations in soil solution in 0-30 cm mineral soil depth using plant species and functional group richness and functional composition as drivers and assessing the effects of conversion of arable land to grassland, spatially heterogeneous soil properties, and climate. We used monthly mean NO3-N concentrations from 62 plots of a grassland plant diversity experiment from 2003 to 2006. Plant species richness (1-60) and functional group composition (1-4 functional groups: legumes, grasses, non-leguminous tall herbs, non-leguminous small herbs) were manipulated in a factorial design. Plant community composition, time since conversion from arable land to grassland, soil texture, and climate data (precipitation, soil moisture, air and soil temperature) were used to develop one general Bayesian multiple regression model for the 62 plots to allow an in-depth evaluation using the experimental design. The model simulated NO3-N concentrations with an overall Bayesian coefficient of determination of 0.48. The temporal course of NO3-N concentrations was simulated differently well for the individual plots with a maximum plot-specific Nash-Sutcliffe Efficiency of 0.57. The model shows that NO3-N concentrations decrease with species richness, but this relation reverses if more than approx. 25 % of legume species are included in the mixture. Presence of legumes increases and presence of grasses decreases NO3-N concentrations compared to mixtures containing only small and tall herbs. Altogether, our model shows that there is a strong influence of plant community composition on NO3-N concentrations.
Resumo:
Consider a nonparametric regression model Y=mu*(X) + e, where the explanatory variables X are endogenous and e satisfies the conditional moment restriction E[e|W]=0 w.p.1 for instrumental variables W. It is well known that in these models the structural parameter mu* is 'ill-posed' in the sense that the function mapping the data to mu* is not continuous. In this paper, we derive the efficiency bounds for estimating linear functionals E[p(X)mu*(X)] and int_{supp(X)}p(x)mu*(x)dx, where p is a known weight function and supp(X) the support of X, without assuming mu* to be well-posed or even identified.
Resumo:
The persistence of low birth weight and intrauterine growth retardation (IUGR) in the United States has puzzled researchers for decades. Much of the work that has been conducted on adverse birth outcomes has focused on low birth weight in general and not on IUGR. Studies that have examined IUGR specifically thus far have focused primarily on individual-level maternal risk factors. These risk factors have only been able to explain a small portion of the variance in IUGR. Therefore, recent work has begun to focus on community-level risk factors in addition to the individual-level maternal characteristics. This study uses Social Ecology to examine the relationship of individual and community-level risk factors and IUGR. Logistic regression was used to establish an individual-level model based on 155, 856 births recorded in Harris County, TX during 1999-2001. IUGR was characterized using a fetal growth ratio method with race/ethnic and sex specific mean birth weights calculated from national vital records. The spatial distributions of 114,460 birth records spatially located within the City of Houston were examined using choropleth, probability and density maps. Census tracts with higher than expected rates of IUGR and high levels of neighborhood disadvantage were highlighted. Neighborhood disadvantage was constructed using socioeconomic variables from the 2000 U.S. Census. Factor analysis was used to create a unified single measure. Lastly, a random coefficients model was used to examine the relationship between varying levels of community disadvantage, given the set of individual-level risk factors for 152,997 birth records spatially located within Harris County, TX. Neighborhood disadvantage was measured using three different indices adapted from previous work. The findings show that pregnancy-induced hypertension, previous preterm infant, tobacco use and insufficient weight gain have the highest association with IUGR. Neighborhood disadvantage only slightly further increases the risk of IUGR (OR 1.12 to 1.23). Although community level disadvantage only helped to explain a small proportion of the variance of IUGR, it did have a significant impact. This finding suggests that community level risk factors should be included in future work with IUGR and that more work needs to be conducted. ^