8 resultados para imputation
em DigitalCommons@The Texas Medical Center
Resumo:
The purpose of this study is to investigate the effects of predictor variable correlations and patterns of missingness with dichotomous and/or continuous data in small samples when missing data is multiply imputed. Missing data of predictor variables is multiply imputed under three different multivariate models: the multivariate normal model for continuous data, the multinomial model for dichotomous data and the general location model for mixed dichotomous and continuous data. Subsequent to the multiple imputation process, Type I error rates of the regression coefficients obtained with logistic regression analysis are estimated under various conditions of correlation structure, sample size, type of data and patterns of missing data. The distributional properties of average mean, variance and correlations among the predictor variables are assessed after the multiple imputation process. ^ For continuous predictor data under the multivariate normal model, Type I error rates are generally within the nominal values with samples of size n = 100. Smaller samples of size n = 50 resulted in more conservative estimates (i.e., lower than the nominal value). Correlation and variance estimates of the original data are retained after multiple imputation with less than 50% missing continuous predictor data. For dichotomous predictor data under the multinomial model, Type I error rates are generally conservative, which in part is due to the sparseness of the data. The correlation structure for the predictor variables is not well retained on multiply-imputed data from small samples with more than 50% missing data with this model. For mixed continuous and dichotomous predictor data, the results are similar to those found under the multivariate normal model for continuous data and under the multinomial model for dichotomous data. With all data types, a fully-observed variable included with variables subject to missingness in the multiple imputation process and subsequent statistical analysis provided liberal (larger than nominal values) Type I error rates under a specific pattern of missing data. It is suggested that future studies focus on the effects of multiple imputation in multivariate settings with more realistic data characteristics and a variety of multivariate analyses, assessing both Type I error and power. ^
Resumo:
Objective: In this secondary data analysis, three statistical methodologies were implemented to handle cases with missing data in a motivational interviewing and feedback study. The aim was to evaluate the impact that these methodologies have on the data analysis. ^ Methods: We first evaluated whether the assumption of missing completely at random held for this study. We then proceeded to conduct a secondary data analysis using a mixed linear model to handle missing data with three methodologies (a) complete case analysis, (b) multiple imputation with explicit model containing outcome variables, time, and the interaction of time and treatment, and (c) multiple imputation with explicit model containing outcome variables, time, the interaction of time and treatment, and additional covariates (e.g., age, gender, smoke, years in school, marital status, housing, race/ethnicity, and if participants play on athletic team). Several comparisons were conducted including the following ones: 1) the motivation interviewing with feedback group (MIF) vs. the assessment only group (AO), the motivation interviewing group (MIO) vs. AO, and the intervention of the feedback only group (FBO) vs. AO, 2) MIF vs. FBO, and 3) MIF vs. MIO.^ Results: We first evaluated the patterns of missingness in this study, which indicated that about 13% of participants showed monotone missing patterns, and about 3.5% showed non-monotone missing patterns. Then we evaluated the assumption of missing completely at random by Little's missing completely at random (MCAR) test, in which the Chi-Square test statistic was 167.8 with 125 degrees of freedom, and its associated p-value was p=0.006, which indicated that the data could not be assumed to be missing completely at random. After that, we compared if the three different strategies reached the same results. For the comparison between MIF and AO as well as the comparison between MIF and FBO, only the multiple imputation with additional covariates by uncongenial and congenial models reached different results. For the comparison between MIF and MIO, all the methodologies for handling missing values obtained different results. ^ Discussions: The study indicated that, first, missingness was crucial in this study. Second, to understand the assumptions of the model was important since we could not identify if the data were missing at random or missing not at random. Therefore, future researches should focus on exploring more sensitivity analyses under missing not at random assumption.^
Resumo:
Most statistical analysis, theory and practice, is concerned with static models; models with a proposed set of parameters whose values are fixed across observational units. Static models implicitly assume that the quantified relationships remain the same across the design space of the data. While this is reasonable under many circumstances this can be a dangerous assumption when dealing with sequentially ordered data. The mere passage of time always brings fresh considerations and the interrelationships among parameters, or subsets of parameters, may need to be continually revised. ^ When data are gathered sequentially dynamic interim monitoring may be useful as new subject-specific parameters are introduced with each new observational unit. Sequential imputation via dynamic hierarchical models is an efficient strategy for handling missing data and analyzing longitudinal studies. Dynamic conditional independence models offers a flexible framework that exploits the Bayesian updating scheme for capturing the evolution of both the population and individual effects over time. While static models often describe aggregate information well they often do not reflect conflicts in the information at the individual level. Dynamic models prove advantageous over static models in capturing both individual and aggregate trends. Computations for such models can be carried out via the Gibbs sampler. An application using a small sample repeated measures normally distributed growth curve data is presented. ^
Resumo:
The purpose of this dissertation was to estimate HIV incidence among the individuals who had HIV tests performed at the Houston Department of Health and Human Services (HDHHS) public health laboratory, and to examine the prevalence of HIV and AIDS concurrent diagnoses among HIV cases reported between 2000 and 2007 in Houston/Harris County. ^ The first study in this dissertation estimated the cumulative HIV incidence among the individuals testing at Houston public health laboratory using Serologic Testing Algorithms for Recent HIV Seroconversion (STARHS) during the two year study period (June 1, 2005 to May 31, 2007). The HIV incidence was estimated using two independently developed statistical imputation methods, one developed by the Centers for Disease Control and Prevention (CDC), and the other developed by HDHHS. Among the 54,394 persons who tested for HIV during the study period, 942 tested HIV positive (positivity rate=1.7%). Of these HIV positives, 448 (48%) were newly reported to the Houston HIV/AIDS Reporting System (HARS) and 417 of these 448 blood specimens (93%) were available for STARHS testing. The STARHS results showed 139 (33%) out of the 417 specimens were newly infected with HIV. Using both the CDC and HDHHS methods, the estimated cumulative HIV incidences over the two-year study period were similar: 862 per 100,000 persons (95% CI: 655-1,070) by CDC method, and 925 per 100,000 persons (95% CI: 908-943) by HDHHS method. Consistent with the national finding, this study found African Americans, and men who have sex with men (MSM) accounted for most of the new HIV infections among the individuals testing at Houston public health laboratory. Using CDC statistical method, this study also found the highest cumulative HIV incidence (2,176 per 100,000 persons [95%CI: 1,536-2,798]) was among those who tested in the HIV counseling and testing sites, compared to the sexually transmitted disease clinics (1,242 per 100,000 persons [95%CI: 871-1,608]) and city health clinics (215 per 100,000 persons [95%CI: 80-353]. This finding suggested the HIV counseling and testing sites in Houston were successful in reaching high risk populations and testing them early for HIV. In addition, older age groups had higher cumulative HIV incidence, but accounted for smaller proportions of new HIV infections. The incidence in the 30-39 age group (994 per 100,000 persons [95%CI: 625-1,363]) was 1.5 times the incidence in 13-29 age group (645 per 100,000 persons [95%CI: 447-840]); the incidences in 40-49 age group (1,371 per 100,000 persons [95%CI: 765-1,977]) and 50 or above age groups (1,369 per 100,000 persons [95%CI: 318-2,415]) were 2.1 times compared to the youngest 13-29 age group. The increased HIV incidence in older age groups suggested that persons 40 or above were still at risk to contract HIV infections. HIV prevention programs should encourage more people who are age 40 and above to test for HIV. ^ The second study investigated concurrent diagnoses of HIV and AIDS in Houston. Concurrent HIV/AIDS diagnosis is defined as AIDS diagnosis within three months of HIV diagnosis. This study found about one-third of the HIV cases were diagnosed with HIV and AIDS concurrently (within three months) in Houston/Harris County. Using multivariable logistic regression analysis, this study found being male, Hispanic, older, and diagnosed in the private sector of care were positively associated with concurrent HIV and AIDS diagnoses. By contrast, men who had sex with men and also used injection drugs (MSM/IDU) were 0.64 times (95% CI: 0.44-0.93) less likely to have concurrent HIV and AIDS diagnoses. A sensitivity analysis comparing difference durations of elapsed time for concurrent HIV and AIDS diagnosis definitions (1-month, 3-month, and 12-month cut-offs) affected the effect size of the odds ratios, but not the direction. ^ The results of these two studies, one describing characteristics of the individuals who were newly infected with HIV, and the other study describing persons who were diagnosed with HIV and AIDS concurrently, can be used as a reference for HIV prevention program planning in Houston/Harris County. ^
Resumo:
Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^
Resumo:
My dissertation focuses mainly on Bayesian adaptive designs for phase I and phase II clinical trials. It includes three specific topics: (1) proposing a novel two-dimensional dose-finding algorithm for biological agents, (2) developing Bayesian adaptive screening designs to provide more efficient and ethical clinical trials, and (3) incorporating missing late-onset responses to make an early stopping decision. Treating patients with novel biological agents is becoming a leading trend in oncology. Unlike cytotoxic agents, for which toxicity and efficacy monotonically increase with dose, biological agents may exhibit non-monotonic patterns in their dose-response relationships. Using a trial with two biological agents as an example, we propose a phase I/II trial design to identify the biologically optimal dose combination (BODC), which is defined as the dose combination of the two agents with the highest efficacy and tolerable toxicity. A change-point model is used to reflect the fact that the dose-toxicity surface of the combinational agents may plateau at higher dose levels, and a flexible logistic model is proposed to accommodate the possible non-monotonic pattern for the dose-efficacy relationship. During the trial, we continuously update the posterior estimates of toxicity and efficacy and assign patients to the most appropriate dose combination. We propose a novel dose-finding algorithm to encourage sufficient exploration of untried dose combinations in the two-dimensional space. Extensive simulation studies show that the proposed design has desirable operating characteristics in identifying the BODC under various patterns of dose-toxicity and dose-efficacy relationships. Trials of combination therapies for the treatment of cancer are playing an increasingly important role in the battle against this disease. To more efficiently handle the large number of combination therapies that must be tested, we propose a novel Bayesian phase II adaptive screening design to simultaneously select among possible treatment combinations involving multiple agents. Our design is based on formulating the selection procedure as a Bayesian hypothesis testing problem in which the superiority of each treatment combination is equated to a single hypothesis. During the trial conduct, we use the current values of the posterior probabilities of all hypotheses to adaptively allocate patients to treatment combinations. Simulation studies show that the proposed design substantially outperforms the conventional multi-arm balanced factorial trial design. The proposed design yields a significantly higher probability for selecting the best treatment while at the same time allocating substantially more patients to efficacious treatments. The proposed design is most appropriate for the trials combining multiple agents and screening out the efficacious combination to be further investigated. The proposed Bayesian adaptive phase II screening design substantially outperformed the conventional complete factorial design. Our design allocates more patients to better treatments while at the same time providing higher power to identify the best treatment at the end of the trial. Phase II trial studies usually are single-arm trials which are conducted to test the efficacy of experimental agents and decide whether agents are promising to be sent to phase III trials. Interim monitoring is employed to stop the trial early for futility to avoid assigning unacceptable number of patients to inferior treatments. We propose a Bayesian single-arm phase II design with continuous monitoring for estimating the response rate of the experimental drug. To address the issue of late-onset responses, we use a piece-wise exponential model to estimate the hazard function of time to response data and handle the missing responses using the multiple imputation approach. We evaluate the operating characteristics of the proposed method through extensive simulation studies. We show that the proposed method reduces the total length of the trial duration and yields desirable operating characteristics for different physician-specified lower bounds of response rate with different true response rates.
Resumo:
Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^
Resumo:
Objective: The purpose of this study is to compare the stages of breast cancer presented between the insured and uninsured patients diagnosed at The Rose, an active non-profit breast healthcare organization to determine if uninsured patients present with more advanced stage breast cancer as compared to their insured counterparts. ^ Study Design: Retrospective cross-sectional study. ^ Methods: The study included 1,265 patients who received breast healthcare services and were diagnosed with breast cancer at The Rose between FY 2007 and FY 2012. 738 of the patients in the study were presumably uninsured since their breast healthcare services were sponsored through various funding sources and they were navigated into treatment through The Rose patient navigation program. We compared breast cancer stages for women who had insurance with those who did not have insurance. The effects of age and race/ethnicity along with the insurance status on the stage of reast cancer diagnosis were also analyzed. We calculated the odds ratio using the contingency tables; and estimated odds ratios (ORs) and 95% confidence intervals (CIs) using ordinal logistic regression by applying multiple imputation method for missing tumor stage data. ^ Results: The ordered logistic regression analysis with ordered tumor stage as dependent variable and uninsured as independent variable gave us an odds ratio of 1.73 (OR=1.73; p-value<0.05; 95% CI: 1.36 - 2.12). ^ Conclusions: Insurance status is a strong predictor of stage of breast cancer diagnosed among women seen at The Rose. Uninsured women seen at The Rose are almost twice as likely to present at a advanced stage of breast cancer as opposed to their insured counterparts.^