29 resultados para Regression imputation
em DigitalCommons@The Texas Medical Center
Resumo:
The purpose of this study is to investigate the effects of predictor variable correlations and patterns of missingness with dichotomous and/or continuous data in small samples when missing data is multiply imputed. Missing data of predictor variables is multiply imputed under three different multivariate models: the multivariate normal model for continuous data, the multinomial model for dichotomous data and the general location model for mixed dichotomous and continuous data. Subsequent to the multiple imputation process, Type I error rates of the regression coefficients obtained with logistic regression analysis are estimated under various conditions of correlation structure, sample size, type of data and patterns of missing data. The distributional properties of average mean, variance and correlations among the predictor variables are assessed after the multiple imputation process. ^ For continuous predictor data under the multivariate normal model, Type I error rates are generally within the nominal values with samples of size n = 100. Smaller samples of size n = 50 resulted in more conservative estimates (i.e., lower than the nominal value). Correlation and variance estimates of the original data are retained after multiple imputation with less than 50% missing continuous predictor data. For dichotomous predictor data under the multinomial model, Type I error rates are generally conservative, which in part is due to the sparseness of the data. The correlation structure for the predictor variables is not well retained on multiply-imputed data from small samples with more than 50% missing data with this model. For mixed continuous and dichotomous predictor data, the results are similar to those found under the multivariate normal model for continuous data and under the multinomial model for dichotomous data. With all data types, a fully-observed variable included with variables subject to missingness in the multiple imputation process and subsequent statistical analysis provided liberal (larger than nominal values) Type I error rates under a specific pattern of missing data. It is suggested that future studies focus on the effects of multiple imputation in multivariate settings with more realistic data characteristics and a variety of multivariate analyses, assessing both Type I error and power. ^
Resumo:
Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^
Resumo:
This study investigates the degree to which gender, ethnicity, relationship to perpetrator, and geomapped socio-economic factors significantly predict the incidence of childhood sexual abuse, physical abuse and non- abuse. These variables are then linked to geographic identifiers using geographic information system (GIS) technology to develop a geo-mapping framework for child sexual and physical abuse prevention.
Resumo:
BACKGROUND: Obesity is a systemic disorder associated with an increase in left ventricular mass and premature death and disability from cardiovascular disease. Although bariatric surgery reverses many of the hormonal and hemodynamic derangements, the long-term collective effects on body composition and left ventricular mass have not been considered before. We hypothesized that the decrease in fat mass and lean mass after weight loss surgery is associated with a decrease in left ventricular mass. METHODS: Fifteen severely obese women (mean body mass index [BMI]: 46.7+/-1.7 kg/m(2)) with medically controlled hypertension underwent bariatric surgery. Left ventricular mass and plasma markers of systemic metabolism, together with body mass index (BMI), waist and hip circumferences, body composition (fat mass and lean mass), and resting energy expenditure were measured at 0, 3, 9, 12, and 24 months. RESULTS: Left ventricular mass continued to decrease linearly over the entire period of observation, while rates of weight loss, loss of lean mass, loss of fat mass, and resting energy expenditure all plateaued at 9 [corrected] months (P <.001 for all). Parameters of systemic metabolism normalized by 9 months, and showed no further change at 24 months after surgery. CONCLUSIONS: Even though parameters of obesity, including BMI and body composition, plateau, the benefits of bariatric surgery on systemic metabolism and left ventricular mass are sustained. We propose that the progressive decrease of left ventricular mass after weight loss surgery is regulated by neurohumoral factors, and may contribute to improved long-term survival.
Resumo:
The adult male golden hamster, when exposed to blinding (BL), short photoperiod (SP), or daily melatonin injections (MEL) demonstrates dramatic reproductive collapse. This collapse can be blocked by removal of the pineal gland prior to treatment. Reproductive collapse is characterized by a dramatic decrease in both testicular weight and serum gonadotropin titers. The present study was designed to examine the interactions of the hypothalamus and pituitary gland during testicular regression, and to specifically compare and contrast changes caused by the three commonly employed methods of inducing testicular regression (BL,SP,MEL). Hypothalamic LHRH content was altered by all three treatments. There was an initial increase in content of LHRH that occurred concomitantly with the decreased serum gonadotropin titers, followed by a precipitous decline in LHRH content which reflected the rapid increases in both serum LH and FSH which occur during spontaneous testicular recrudescence. In vitro pituitary responsiveness was altered by all three treatments: there was a decline in basal and maximally stimulatable release of both LH and FSH which paralleled the fall of serum gonadotropins. During recrudescence both basal and maximal release dramatically increased in a manner comparable to serum hormone levels. While all three treatments were equally effective in their ability to induce changes at all levels of the endocrine system, there were important temporal differences in the effects of the various treatments. Melatonin injections induced the most rapid changes in endocrine parameters, followed by exposure to short photoperiod. Blinding required the most time to induce the same changes. This study has demonstrated that pineal-mediated testicular regression is a process which involves dynamic changes in multiply-dependent endocrine relationships, and proper evaluation of these changes must be performed with specific temporal events in mind. ^
Resumo:
The ordinal logistic regression models are used to analyze the dependant variable with multiple outcomes that can be ranked, but have been underutilized. In this study, we describe four logistic regression models for analyzing the ordinal response variable. ^ In this methodological study, the four regression models are proposed. The first model uses the multinomial logistic model. The second is adjacent-category logit model. The third is the proportional odds model and the fourth model is the continuation-ratio model. We illustrate and compare the fit of these models using data from the survey designed by the University of Texas, School of Public Health research project PCCaSO (Promoting Colon Cancer Screening in people 50 and Over), to study the patient’s confidence in the completion colorectal cancer screening (CRCS). ^ The purpose of this study is two fold: first, to provide a synthesized review of models for analyzing data with ordinal response, and second, to evaluate their usefulness in epidemiological research, with particular emphasis on model formulation, interpretation of model coefficients, and their implications. Four ordinal logistic models that are used in this study include (1) Multinomial logistic model, (2) Adjacent-category logistic model [9], (3) Continuation-ratio logistic model [10], (4) Proportional logistic model [11]. We recommend that the analyst performs (1) goodness-of-fit tests, (2) sensitivity analysis by fitting and comparing different models.^
Resumo:
Ordinal outcomes are frequently employed in diagnosis and clinical trials. Clinical trials of Alzheimer's disease (AD) treatments are a case in point using the status of mild, moderate or severe disease as outcome measures. As in many other outcome oriented studies, the disease status may be misclassified. This study estimates the extent of misclassification in an ordinal outcome such as disease status. Also, this study estimates the extent of misclassification of a predictor variable such as genotype status. An ordinal logistic regression model is commonly used to model the relationship between disease status, the effect of treatment, and other predictive factors. A simulation study was done. First, data based on a set of hypothetical parameters and hypothetical rates of misclassification was created. Next, the maximum likelihood method was employed to generate likelihood equations accounting for misclassification. The Nelder-Mead Simplex method was used to solve for the misclassification and model parameters. Finally, this method was applied to an AD dataset to detect the amount of misclassification present. The estimates of the ordinal regression model parameters were close to the hypothetical parameters. β1 was hypothesized at 0.50 and the mean estimate was 0.488, β2 was hypothesized at 0.04 and the mean of the estimates was 0.04. Although the estimates for the rates of misclassification of X1 were not as close as β1 and β2, they validate this method. X 1 0-1 misclassification was hypothesized as 2.98% and the mean of the simulated estimates was 1.54% and, in the best case, the misclassification of k from high to medium was hypothesized at 4.87% and had a sample mean of 3.62%. In the AD dataset, the estimate for the odds ratio of X 1 of having both copies of the APOE 4 allele changed from an estimate of 1.377 to an estimate 1.418, demonstrating that the estimates of the odds ratio changed when the analysis includes adjustment for misclassification. ^
Resumo:
The purpose of this dissertation was to estimate HIV incidence among the individuals who had HIV tests performed at the Houston Department of Health and Human Services (HDHHS) public health laboratory, and to examine the prevalence of HIV and AIDS concurrent diagnoses among HIV cases reported between 2000 and 2007 in Houston/Harris County. ^ The first study in this dissertation estimated the cumulative HIV incidence among the individuals testing at Houston public health laboratory using Serologic Testing Algorithms for Recent HIV Seroconversion (STARHS) during the two year study period (June 1, 2005 to May 31, 2007). The HIV incidence was estimated using two independently developed statistical imputation methods, one developed by the Centers for Disease Control and Prevention (CDC), and the other developed by HDHHS. Among the 54,394 persons who tested for HIV during the study period, 942 tested HIV positive (positivity rate=1.7%). Of these HIV positives, 448 (48%) were newly reported to the Houston HIV/AIDS Reporting System (HARS) and 417 of these 448 blood specimens (93%) were available for STARHS testing. The STARHS results showed 139 (33%) out of the 417 specimens were newly infected with HIV. Using both the CDC and HDHHS methods, the estimated cumulative HIV incidences over the two-year study period were similar: 862 per 100,000 persons (95% CI: 655-1,070) by CDC method, and 925 per 100,000 persons (95% CI: 908-943) by HDHHS method. Consistent with the national finding, this study found African Americans, and men who have sex with men (MSM) accounted for most of the new HIV infections among the individuals testing at Houston public health laboratory. Using CDC statistical method, this study also found the highest cumulative HIV incidence (2,176 per 100,000 persons [95%CI: 1,536-2,798]) was among those who tested in the HIV counseling and testing sites, compared to the sexually transmitted disease clinics (1,242 per 100,000 persons [95%CI: 871-1,608]) and city health clinics (215 per 100,000 persons [95%CI: 80-353]. This finding suggested the HIV counseling and testing sites in Houston were successful in reaching high risk populations and testing them early for HIV. In addition, older age groups had higher cumulative HIV incidence, but accounted for smaller proportions of new HIV infections. The incidence in the 30-39 age group (994 per 100,000 persons [95%CI: 625-1,363]) was 1.5 times the incidence in 13-29 age group (645 per 100,000 persons [95%CI: 447-840]); the incidences in 40-49 age group (1,371 per 100,000 persons [95%CI: 765-1,977]) and 50 or above age groups (1,369 per 100,000 persons [95%CI: 318-2,415]) were 2.1 times compared to the youngest 13-29 age group. The increased HIV incidence in older age groups suggested that persons 40 or above were still at risk to contract HIV infections. HIV prevention programs should encourage more people who are age 40 and above to test for HIV. ^ The second study investigated concurrent diagnoses of HIV and AIDS in Houston. Concurrent HIV/AIDS diagnosis is defined as AIDS diagnosis within three months of HIV diagnosis. This study found about one-third of the HIV cases were diagnosed with HIV and AIDS concurrently (within three months) in Houston/Harris County. Using multivariable logistic regression analysis, this study found being male, Hispanic, older, and diagnosed in the private sector of care were positively associated with concurrent HIV and AIDS diagnoses. By contrast, men who had sex with men and also used injection drugs (MSM/IDU) were 0.64 times (95% CI: 0.44-0.93) less likely to have concurrent HIV and AIDS diagnoses. A sensitivity analysis comparing difference durations of elapsed time for concurrent HIV and AIDS diagnosis definitions (1-month, 3-month, and 12-month cut-offs) affected the effect size of the odds ratios, but not the direction. ^ The results of these two studies, one describing characteristics of the individuals who were newly infected with HIV, and the other study describing persons who were diagnosed with HIV and AIDS concurrently, can be used as a reference for HIV prevention program planning in Houston/Harris County. ^
Resumo:
Objectives. This paper seeks to assess the effect on statistical power of regression model misspecification in a variety of situations. ^ Methods and results. The effect of misspecification in regression can be approximated by evaluating the correlation between the correct specification and the misspecification of the outcome variable (Harris 2010).In this paper, three misspecified models (linear, categorical and fractional polynomial) were considered. In the first section, the mathematical method of calculating the correlation between correct and misspecified models with simple mathematical forms was derived and demonstrated. In the second section, data from the National Health and Nutrition Examination Survey (NHANES 2007-2008) were used to examine such correlations. Our study shows that comparing to linear or categorical models, the fractional polynomial models, with the higher correlations, provided a better approximation of the true relationship, which was illustrated by LOESS regression. In the third section, we present the results of simulation studies that demonstrate overall misspecification in regression can produce marked decreases in power with small sample sizes. However, the categorical model had greatest power, ranging from 0.877 to 0.936 depending on sample size and outcome variable used. The power of fractional polynomial model was close to that of linear model, which ranged from 0.69 to 0.83, and appeared to be affected by the increased degrees of freedom of this model.^ Conclusion. Correlations between alternative model specifications can be used to provide a good approximation of the effect on statistical power of misspecification when the sample size is large. When model specifications have known simple mathematical forms, such correlations can be calculated mathematically. Actual public health data from NHANES 2007-2008 were used as examples to demonstrate the situations with unknown or complex correct model specification. Simulation of power for misspecified models confirmed the results based on correlation methods but also illustrated the effect of model degrees of freedom on power.^
Resumo:
The standard analyses of survival data involve the assumption that survival and censoring are independent. When censoring and survival are related, the phenomenon is known as informative censoring. This paper examines the effects of an informative censoring assumption on the hazard function and the estimated hazard ratio provided by the Cox model.^ The limiting factor in all analyses of informative censoring is the problem of non-identifiability. Non-identifiability implies that it is impossible to distinguish a situation in which censoring and death are independent from one in which there is dependence. However, it is possible that informative censoring occurs. Examination of the literature indicates how others have approached the problem and covers the relevant theoretical background.^ Three models are examined in detail. The first model uses conditionally independent marginal hazards to obtain the unconditional survival function and hazards. The second model is based on the Gumbel Type A method for combining independent marginal distributions into bivariate distributions using a dependency parameter. Finally, a formulation based on a compartmental model is presented and its results described. For the latter two approaches, the resulting hazard is used in the Cox model in a simulation study.^ The unconditional survival distribution formed from the first model involves dependency, but the crude hazard resulting from this unconditional distribution is identical to the marginal hazard, and inferences based on the hazard are valid. The hazard ratios formed from two distributions following the Gumbel Type A model are biased by a factor dependent on the amount of censoring in the two populations and the strength of the dependency of death and censoring in the two populations. The Cox model estimates this biased hazard ratio. In general, the hazard resulting from the compartmental model is not constant, even if the individual marginal hazards are constant, unless censoring is non-informative. The hazard ratio tends to a specific limit.^ Methods of evaluating situations in which informative censoring is present are described, and the relative utility of the three models examined is discussed. ^
Resumo:
Strategies are compared for the development of a linear regression model with stochastic (multivariate normal) regressor variables and the subsequent assessment of its predictive ability. Bias and mean squared error of four estimators of predictive performance are evaluated in simulated samples of 32 population correlation matrices. Models including all of the available predictors are compared with those obtained using selected subsets. The subset selection procedures investigated include two stopping rules, C$\sb{\rm p}$ and S$\sb{\rm p}$, each combined with an 'all possible subsets' or 'forward selection' of variables. The estimators of performance utilized include parametric (MSEP$\sb{\rm m}$) and non-parametric (PRESS) assessments in the entire sample, and two data splitting estimates restricted to a random or balanced (Snee's DUPLEX) 'validation' half sample. The simulations were performed as a designed experiment, with population correlation matrices representing a broad range of data structures.^ The techniques examined for subset selection do not generally result in improved predictions relative to the full model. Approaches using 'forward selection' result in slightly smaller prediction errors and less biased estimators of predictive accuracy than 'all possible subsets' approaches but no differences are detected between the performances of C$\sb{\rm p}$ and S$\sb{\rm p}$. In every case, prediction errors of models obtained by subset selection in either of the half splits exceed those obtained using all predictors and the entire sample.^ Only the random split estimator is conditionally (on $\\beta$) unbiased, however MSEP$\sb{\rm m}$ is unbiased on average and PRESS is nearly so in unselected (fixed form) models. When subset selection techniques are used, MSEP$\sb{\rm m}$ and PRESS always underestimate prediction errors, by as much as 27 percent (on average) in small samples. Despite their bias, the mean squared errors (MSE) of these estimators are at least 30 percent less than that of the unbiased random split estimator. The DUPLEX split estimator suffers from large MSE as well as bias, and seems of little value within the context of stochastic regressor variables.^ To maximize predictive accuracy while retaining a reliable estimate of that accuracy, it is recommended that the entire sample be used for model development, and a leave-one-out statistic (e.g. PRESS) be used for assessment. ^
Resumo:
This dissertation develops and explores the methodology for the use of cubic spline functions in assessing time-by-covariate interactions in Cox proportional hazards regression models. These interactions indicate violations of the proportional hazards assumption of the Cox model. Use of cubic spline functions allows for the investigation of the shape of a possible covariate time-dependence without having to specify a particular functional form. Cubic spline functions yield both a graphical method and a formal test for the proportional hazards assumption as well as a test of the nonlinearity of the time-by-covariate interaction. Five existing methods for assessing violations of the proportional hazards assumption are reviewed and applied along with cubic splines to three well known two-sample datasets. An additional dataset with three covariates is used to explore the use of cubic spline functions in a more general setting. ^
Resumo:
A Bayesian approach to estimation of the regression coefficients of a multinominal logit model with ordinal scale response categories is presented. A Monte Carlo method is used to construct the posterior distribution of the link function. The link function is treated as an arbitrary scalar function. Then the Gauss-Markov theorem is used to determine a function of the link which produces a random vector of coefficients. The posterior distribution of the random vector of coefficients is used to estimate the regression coefficients. The method described is referred to as a Bayesian generalized least square (BGLS) analysis. Two cases involving multinominal logit models are described. Case I involves a cumulative logit model and Case II involves a proportional-odds model. All inferences about the coefficients for both cases are described in terms of the posterior distribution of the regression coefficients. The results from the BGLS method are compared to maximum likelihood estimates of the regression coefficients. The BGLS method avoids the nonlinear problems encountered when estimating the regression coefficients of a generalized linear model. The method is not complex or computationally intensive. The BGLS method offers several advantages over Bayesian approaches. ^
Resumo:
A large number of ridge regression estimators have been proposed and used with little knowledge of their true distributions. Because of this lack of knowledge, these estimators cannot be used to test hypotheses or to form confidence intervals.^ This paper presents a basic technique for deriving the exact distribution functions for a class of generalized ridge estimators. The technique is applied to five prominent generalized ridge estimators. Graphs of the resulting distribution functions are presented. The actual behavior of these estimators is found to be considerably different than the behavior which is generally assumed for ridge estimators.^ This paper also uses the derived distributions to examine the mean squared error properties of the estimators. A technique for developing confidence intervals based on the generalized ridge estimators is also presented. ^
Resumo:
The history of the logistic function since its introduction in 1838 is reviewed, and the logistic model for a polychotomous response variable is presented with a discussion of the assumptions involved in its derivation and use. Following this, the maximum likelihood estimators for the model parameters are derived along with a Newton-Raphson iterative procedure for evaluation. A rigorous mathematical derivation of the limiting distribution of the maximum likelihood estimators is then presented using a characteristic function approach. An appendix with theorems on the asymptotic normality of sample sums when the observations are not identically distributed, with proofs, supports the presentation on asymptotic properties of the maximum likelihood estimators. Finally, two applications of the model are presented using data from the Hypertension Detection and Follow-up Program, a prospective, population-based, randomized trial of treatment for hypertension. The first application compares the risk of five-year mortality from cardiovascular causes with that from noncardiovascular causes; the second application compares risk factors for fatal or nonfatal coronary heart disease with those for fatal or nonfatal stroke. ^