878 resultados para regression discrete models
Resumo:
Purpose Recently, multiple clinical trials have demonstrated improved outcomes in patients with metastatic colorectal cancer. This study investigated if the improved survival is race dependent. Patients and Methods Overall and cancer-specific survival of 77,490 White and Black patients with metastatic colorectal cancer from the 1988–2008 Surveillance Epidemiology and End Results registry were compared using unadjusted and multivariable adjusted Cox proportional hazard regression as well as competing risk analyses. Results Median age was 69 years, 47.4 % were female and 86.0 % White. Median survival was 11 months overall, with an overall increase from 8 to 14 months between 1988 and 2008. Overall survival increased from 8 to 14 months for White, and from 6 to 13 months for Black patients. After multivariable adjustment, the following parameters were associated with better survival: White, female, younger, better educated and married patients, patients with higher income and living in urban areas, patients with rectosigmoid junction and rectal cancer, undergoing cancer-directed surgery, having well/moderately differentiated, and N0 tumors (p<0.05 for all covariates). Discrepancies in overall survival based on race did not change significantly over time; however, there was a significant decrease of cancer-specific survival discrepancies over time between White and Black patients with a hazard ratio of 0.995 (95 % confidence interval 0.991–1.000) per year (p=0.03). Conclusion A clinically relevant overall survival increase was found from 1988 to 2008 in this population-based analysis for both White and Black patients with metastatic colorectal cancer. Although both White and Black patients benefitted from this improvement, a slight discrepancy between the two groups remained.
Resumo:
Graphical display of regression results has become increasingly popular in presentations and in scientific literature because graphs are often much easier to read than tables. Such plots can be produced in Stata by the marginsplot command (see [R] marginsplot). However, while marginsplot is versatile and flexible, it has two major limitations: it can only process results left behind by margins (see [R] margins), and it can handle only one set of results at a time. In this article, I introduce a new command called coefplot that overcomes these limitations. It plots results from any estimation command and combines results from several models into one graph. The default behavior of coefplot is to plot markers for coefficients and horizontal spikes for confidence intervals. However, coefplot can also produce other types of graphs. I illustrate the capabilities of coefplot by using a series of examples.
Resumo:
Organizing and archiving statistical results and processing a subset of those results for publication are important and often underestimated issues in conducting statistical analyses. Because automation of these tasks is often poor, processing results produced by statistical packages is quite laborious and vulnerable to error. I will therefore present a new package called estout that facilitates and automates some of these tasks. This new command can be used to produce regression tables for use with spreadsheets, LaTeX, HTML, or word processors. For example, the results for multiple models can be organized in spreadsheets and can thus be archived in an orderly manner. Alternatively, the results can be directly saved as a publication-ready table for inclusion in, for example, a LaTeX document. estout is implemented as a wrapper for estimates table but has many additional features, such as support for mfx. However, despite its flexibility, estout is—I believe—still very straightforward and easy to use. Furthermore, estout can be customized via so-called defaults files. A tool to make available supplementary statistics called estadd is also provided.
Resumo:
robreg provides a number of robust estimators for linear regression models. Among them are the high breakdown-point and high efficiency MM-estimator, the Huber and bisquare M-estimator, and the S-estimator, each supporting classic or robust standard errors. Furthermore, basic versions of the LMS/LQS (least median of squares) and LTS (least trimmed squares) estimators are provided. Note that the moremata package, also available from SSC, is required.
Resumo:
Parameter estimates from commonly used multivariable parametric survival regression models do not directly quantify differences in years of life expectancy. Gaussian linear regression models give results in terms of absolute mean differences, but are not appropriate in modeling life expectancy, because in many situations time to death has a negative skewed distribution. A regression approach using a skew-normal distribution would be an alternative to parametric survival models in the modeling of life expectancy, because parameter estimates can be interpreted in terms of survival time differences while allowing for skewness of the distribution. In this paper we show how to use the skew-normal regression so that censored and left-truncated observations are accounted for. With this we model differences in life expectancy using data from the Swiss National Cohort Study and from official life expectancy estimates and compare the results with those derived from commonly used survival regression models. We conclude that a censored skew-normal survival regression approach for left-truncated observations can be used to model differences in life expectancy across covariates of interest.
Resumo:
Effects of conspecific neighbours on survival and growth of trees have been found to be related to species abundance. Both positive and negative relationships may explain observed abundance patterns. Surprisingly, it is rarely tested whether such relationships could be biased or even spurious due to transforming neighbourhood variables or influences of spatial aggregation, distance decay of neighbour effects and standardization of effect sizes. To investigate potential biases, communities of 20 identical species were simulated with log-series abundances but without species-specific interactions. No relationship of conspecific neighbour effects on survival or growth with species abundance was expected. Survival and growth of individuals was simulated in random and aggregated spatial patterns using no, linear, or squared distance decay of neighbour effects. Regression coefficients of statistical neighbourhood models were unbiased and unrelated to species abundance. However, variation in the number of conspecific neighbours was positively or negatively related to species abundance depending on transformations of neighbourhood variables, spatial pattern and distance decay. Consequently, effect sizes and standardized regression coefficients, often used in model fitting across large numbers of species, were also positively or negatively related to species abundance depending on transformation of neighbourhood variables, spatial pattern and distance decay. Tests using randomized tree positions and identities provide the best benchmarks by which to critically evaluate relationships of effect sizes or standardized regression coefficients with tree species abundance. This will better guard against potential misinterpretations.
Resumo:
Chironomid-temperature inference models based on North American, European and combined surface sediment training sets were compared to assess the overall reliability of their predictions. Between 67 and 76 of the major chironomid taxa in each data set showed a unimodal response to July temperature, whereas between 5 and 22 of the common taxa showed a sigmoidal response. July temperature optima were highly correlated among the training sets, but the correlations for other taxon parameters such as tolerances and weighted averaging partial least squares (WA-PLS) and partial least squares (PLS) regression coefficients were much weaker. PLS, weighted averaging, WA-PLS, and the Modern Analogue Technique, all provided useful and reliable temperature inferences. Although jack-knifed error statistics suggested that two-component WA-PLS models had the highest predictive power, intercontinental tests suggested that other inference models performed better. The various models were able to provide good July temperature inferences, even where neither good nor close modern analogues for the fossil chironomid assemblages existed. When the models were applied to fossil Lateglacial assemblages from North America and Europe, the inferred rates and magnitude of July temperature changes varied among models. All models, however, revealed similar patterns of Lateglacial temperature change. Depending on the model used, the inferred Younger Dryas July temperature decrease ranged between 2.5 and 6°C.
Resumo:
In applied work economists often seek to relate a given response variable y to some causal parameter mu* associated with it. This parameter usually represents a summarization based on some explanatory variables of the distribution of y, such as a regression function, and treating it as a conditional expectation is central to its identification and estimation. However, the interpretation of mu* as a conditional expectation breaks down if some or all of the explanatory variables are endogenous. This is not a problem when mu* is modelled as a parametric function of explanatory variables because it is well known how instrumental variables techniques can be used to identify and estimate mu*. In contrast, handling endogenous regressors in nonparametric models, where mu* is regarded as fully unknown, presents di±cult theoretical and practical challenges. In this paper we consider an endogenous nonparametric model based on a conditional moment restriction. We investigate identification related properties of this model when the unknown function mu* belongs to a linear space. We also investigate underidentification of mu* along with the identification of its linear functionals. Several examples are provided in order to develop intuition about identification and estimation for endogenous nonparametric regression and related models.
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
The discrete-time Markov chain is commonly used in describing changes of health states for chronic diseases in a longitudinal study. Statistical inferences on comparing treatment effects or on finding determinants of disease progression usually require estimation of transition probabilities. In many situations when the outcome data have some missing observations or the variable of interest (called a latent variable) can not be measured directly, the estimation of transition probabilities becomes more complicated. In the latter case, a surrogate variable that is easier to access and can gauge the characteristics of the latent one is usually used for data analysis. ^ This dissertation research proposes methods to analyze longitudinal data (1) that have categorical outcome with missing observations or (2) that use complete or incomplete surrogate observations to analyze the categorical latent outcome. For (1), different missing mechanisms were considered for empirical studies using methods that include EM algorithm, Monte Carlo EM and a procedure that is not a data augmentation method. For (2), the hidden Markov model with the forward-backward procedure was applied for parameter estimation. This method was also extended to cover the computation of standard errors. The proposed methods were demonstrated by the Schizophrenia example. The relevance of public health, the strength and limitations, and possible future research were also discussed. ^
Resumo:
Objectives. This paper seeks to assess the effect on statistical power of regression model misspecification in a variety of situations. ^ Methods and results. The effect of misspecification in regression can be approximated by evaluating the correlation between the correct specification and the misspecification of the outcome variable (Harris 2010).In this paper, three misspecified models (linear, categorical and fractional polynomial) were considered. In the first section, the mathematical method of calculating the correlation between correct and misspecified models with simple mathematical forms was derived and demonstrated. In the second section, data from the National Health and Nutrition Examination Survey (NHANES 2007-2008) were used to examine such correlations. Our study shows that comparing to linear or categorical models, the fractional polynomial models, with the higher correlations, provided a better approximation of the true relationship, which was illustrated by LOESS regression. In the third section, we present the results of simulation studies that demonstrate overall misspecification in regression can produce marked decreases in power with small sample sizes. However, the categorical model had greatest power, ranging from 0.877 to 0.936 depending on sample size and outcome variable used. The power of fractional polynomial model was close to that of linear model, which ranged from 0.69 to 0.83, and appeared to be affected by the increased degrees of freedom of this model.^ Conclusion. Correlations between alternative model specifications can be used to provide a good approximation of the effect on statistical power of misspecification when the sample size is large. When model specifications have known simple mathematical forms, such correlations can be calculated mathematically. Actual public health data from NHANES 2007-2008 were used as examples to demonstrate the situations with unknown or complex correct model specification. Simulation of power for misspecified models confirmed the results based on correlation methods but also illustrated the effect of model degrees of freedom on power.^
Resumo:
The standard analyses of survival data involve the assumption that survival and censoring are independent. When censoring and survival are related, the phenomenon is known as informative censoring. This paper examines the effects of an informative censoring assumption on the hazard function and the estimated hazard ratio provided by the Cox model.^ The limiting factor in all analyses of informative censoring is the problem of non-identifiability. Non-identifiability implies that it is impossible to distinguish a situation in which censoring and death are independent from one in which there is dependence. However, it is possible that informative censoring occurs. Examination of the literature indicates how others have approached the problem and covers the relevant theoretical background.^ Three models are examined in detail. The first model uses conditionally independent marginal hazards to obtain the unconditional survival function and hazards. The second model is based on the Gumbel Type A method for combining independent marginal distributions into bivariate distributions using a dependency parameter. Finally, a formulation based on a compartmental model is presented and its results described. For the latter two approaches, the resulting hazard is used in the Cox model in a simulation study.^ The unconditional survival distribution formed from the first model involves dependency, but the crude hazard resulting from this unconditional distribution is identical to the marginal hazard, and inferences based on the hazard are valid. The hazard ratios formed from two distributions following the Gumbel Type A model are biased by a factor dependent on the amount of censoring in the two populations and the strength of the dependency of death and censoring in the two populations. The Cox model estimates this biased hazard ratio. In general, the hazard resulting from the compartmental model is not constant, even if the individual marginal hazards are constant, unless censoring is non-informative. The hazard ratio tends to a specific limit.^ Methods of evaluating situations in which informative censoring is present are described, and the relative utility of the three models examined is discussed. ^
Resumo:
A Bayesian approach to estimation of the regression coefficients of a multinominal logit model with ordinal scale response categories is presented. A Monte Carlo method is used to construct the posterior distribution of the link function. The link function is treated as an arbitrary scalar function. Then the Gauss-Markov theorem is used to determine a function of the link which produces a random vector of coefficients. The posterior distribution of the random vector of coefficients is used to estimate the regression coefficients. The method described is referred to as a Bayesian generalized least square (BGLS) analysis. Two cases involving multinominal logit models are described. Case I involves a cumulative logit model and Case II involves a proportional-odds model. All inferences about the coefficients for both cases are described in terms of the posterior distribution of the regression coefficients. The results from the BGLS method are compared to maximum likelihood estimates of the regression coefficients. The BGLS method avoids the nonlinear problems encountered when estimating the regression coefficients of a generalized linear model. The method is not complex or computationally intensive. The BGLS method offers several advantages over Bayesian approaches. ^
Resumo:
Four basic medical decision making models are commonly discussed in the literature in reference to physician-patient interactions. All fall short in their attempt to capture the nuances of physician-patient interactions, and none satisfactorily address patients' preferences for communication and other attributes of care. Prostate cancer consultations are one setting where preferences matter and are likely to vary among patients. Fortunately, discrete choice experiments are capable of casting light on patients' preferences for communication and other attributes of value that make up a consultation before the consultation occurs, which is crucial if patients are to derive the most utility from the process of reaching a decision as well as the decision itself. The results of my dissertation provide strong support to the notion that patients, at least in the hypothetical setting of a DCE, have identifiable preferences for the attributes of a prostate cancer consultation and that those preferences are capable of being elicited before a consultation takes place. Further, patients' willingness-to-pay for the non-cost attributes of the consultation is surprisingly robust to a variety of individual level variables of interest. ^
Resumo:
The infant mortality rate (IMR) is considered to be one of the most important indices of a country's well-being. Countries around the world and other health organizations like the World Health Organization are dedicating their resources, knowledge and energy to reduce the infant mortality rates. The well-known Millennium Development Goal 4 (MDG 4), whose aim is to archive a two thirds reduction of the under-five mortality rate between 1990 and 2015, is an example of the commitment. ^ In this study our goal is to model the trends of IMR between the 1950s to 2010s for selected countries. We would like to know how the IMR is changing overtime and how it differs across countries. ^ IMR data collected over time forms a time series. The repeated observations of IMR time series are not statistically independent. So in modeling the trend of IMR, it is necessary to account for these correlations. We proposed to use the generalized least squares method in general linear models setting to deal with the variance-covariance structure in our model. In order to estimate the variance-covariance matrix, we referred to the time-series models, especially the autoregressive and moving average models. Furthermore, we will compared results from general linear model with correlation structure to that from ordinary least squares method without taking into account the correlation structure to check how significantly the estimates change.^