12 resultados para Multivariate Normal Distribution
em DigitalCommons@The Texas Medical Center
Resumo:
Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^
Resumo:
Environmental data sets of pollutant concentrations in air, water, and soil frequently include unquantified sample values reported only as being below the analytical method detection limit. These values, referred to as censored values, should be considered in the estimation of distribution parameters as each represents some value of pollutant concentration between zero and the detection limit. Most of the currently accepted methods for estimating the population parameters of environmental data sets containing censored values rely upon the assumption of an underlying normal (or transformed normal) distribution. This assumption can result in unacceptable levels of error in parameter estimation due to the unbounded left tail of the normal distribution. With the beta distribution, which is bounded by the same range of a distribution of concentrations, $\rm\lbrack0\le x\le1\rbrack,$ parameter estimation errors resulting from improper distribution bounds are avoided. This work developed a method that uses the beta distribution to estimate population parameters from censored environmental data sets and evaluated its performance in comparison to currently accepted methods that rely upon an underlying normal (or transformed normal) distribution. Data sets were generated assuming typical values encountered in environmental pollutant evaluation for mean, standard deviation, and number of variates. For each set of model values, data sets were generated assuming that the data was distributed either normally, lognormally, or according to a beta distribution. For varying levels of censoring, two established methods of parameter estimation, regression on normal ordered statistics, and regression on lognormal ordered statistics, were used to estimate the known mean and standard deviation of each data set. The method developed for this study, employing a beta distribution assumption, was also used to estimate parameters and the relative accuracy of all three methods were compared. For data sets of all three distribution types, and for censoring levels up to 50%, the performance of the new method equaled, if not exceeded, the performance of the two established methods. Because of its robustness in parameter estimation regardless of distribution type or censoring level, the method employing the beta distribution should be considered for full development in estimating parameters for censored environmental data sets. ^
Resumo:
The purpose of this study is to investigate the effects of predictor variable correlations and patterns of missingness with dichotomous and/or continuous data in small samples when missing data is multiply imputed. Missing data of predictor variables is multiply imputed under three different multivariate models: the multivariate normal model for continuous data, the multinomial model for dichotomous data and the general location model for mixed dichotomous and continuous data. Subsequent to the multiple imputation process, Type I error rates of the regression coefficients obtained with logistic regression analysis are estimated under various conditions of correlation structure, sample size, type of data and patterns of missing data. The distributional properties of average mean, variance and correlations among the predictor variables are assessed after the multiple imputation process. ^ For continuous predictor data under the multivariate normal model, Type I error rates are generally within the nominal values with samples of size n = 100. Smaller samples of size n = 50 resulted in more conservative estimates (i.e., lower than the nominal value). Correlation and variance estimates of the original data are retained after multiple imputation with less than 50% missing continuous predictor data. For dichotomous predictor data under the multinomial model, Type I error rates are generally conservative, which in part is due to the sparseness of the data. The correlation structure for the predictor variables is not well retained on multiply-imputed data from small samples with more than 50% missing data with this model. For mixed continuous and dichotomous predictor data, the results are similar to those found under the multivariate normal model for continuous data and under the multinomial model for dichotomous data. With all data types, a fully-observed variable included with variables subject to missingness in the multiple imputation process and subsequent statistical analysis provided liberal (larger than nominal values) Type I error rates under a specific pattern of missing data. It is suggested that future studies focus on the effects of multiple imputation in multivariate settings with more realistic data characteristics and a variety of multivariate analyses, assessing both Type I error and power. ^
Resumo:
Interaction effect is an important scientific interest for many areas of research. Common approach for investigating the interaction effect of two continuous covariates on a response variable is through a cross-product term in multiple linear regression. In epidemiological studies, the two-way analysis of variance (ANOVA) type of method has also been utilized to examine the interaction effect by replacing the continuous covariates with their discretized levels. However, the implications of model assumptions of either approach have not been examined and the statistical validation has only focused on the general method, not specifically for the interaction effect.^ In this dissertation, we investigated the validity of both approaches based on the mathematical assumptions for non-skewed data. We showed that linear regression may not be an appropriate model when the interaction effect exists because it implies a highly skewed distribution for the response variable. We also showed that the normality and constant variance assumptions required by ANOVA are not satisfied in the model where the continuous covariates are replaced with their discretized levels. Therefore, naïve application of ANOVA method may lead to an incorrect conclusion. ^ Given the problems identified above, we proposed a novel method modifying from the traditional ANOVA approach to rigorously evaluate the interaction effect. The analytical expression of the interaction effect was derived based on the conditional distribution of the response variable given the discretized continuous covariates. A testing procedure that combines the p-values from each level of the discretized covariates was developed to test the overall significance of the interaction effect. According to the simulation study, the proposed method is more powerful then the least squares regression and the ANOVA method in detecting the interaction effect when data comes from a trivariate normal distribution. The proposed method was applied to a dataset from the National Institute of Neurological Disorders and Stroke (NINDS) tissue plasminogen activator (t-PA) stroke trial, and baseline age-by-weight interaction effect was found significant in predicting the change from baseline in NIHSS at Month-3 among patients received t-PA therapy.^
Resumo:
Health departments, research institutions, policy-makers, and healthcare providers are often interested in knowing the health status of their clients/constituents. Without the resources, financially or administratively, to go out into the community and conduct health assessments directly, these entities frequently rely on data from population-based surveys to supply the information they need. Unfortunately, these surveys are ill-equipped for the job due to sample size and privacy concerns. Small area estimation (SAE) techniques have excellent potential in such circumstances, but have been underutilized in public health due to lack of awareness and confidence in applying its methods. The goal of this research is to make model-based SAE accessible to a broad readership using clear, example-based learning. Specifically, we applied the principles of multilevel, unit-level SAE to describe the geographic distribution of HPV vaccine coverage among females aged 11-26 in Texas.^ Multilevel (3 level: individual, county, public health region) random-intercept logit models of HPV vaccination (receipt of ≥ 1 dose Gardasil® ) were fit to data from the 2008 Behavioral Risk Factor Surveillance System (outcome and level 1 covariates) and a number of secondary sources (group-level covariates). Sampling weights were scaled (level 1) or constructed (levels 2 & 3), and incorporated at every level. Using the regression coefficients (and standard errors) from the final models, I simulated 10,000 datasets for each regression coefficient from the normal distribution and applied them to the logit model to estimate HPV vaccine coverage in each county and respective demographic subgroup. For simplicity, I only provide coverage estimates (and 95% confidence intervals) for counties.^ County-level coverage among females aged 11-17 varied from 6.8-29.0%. For females aged 18-26, coverage varied from 1.9%-23.8%. Aggregated to the state level, these values translate to indirect state estimates of 15.5% and 11.4%, respectively; both of which fall within the confidence intervals for the direct estimates of HPV vaccine coverage in Texas (Females 11-17: 17.7%, 95% CI: 13.6, 21.9; Females 18-26: 12.0%, 95% CI: 6.2, 17.7).^ Small area estimation has great potential for informing policy, program development and evaluation, and the provision of health services. Harnessing the flexibility of multilevel, unit-level SAE to estimate HPV vaccine coverage among females aged 11-26 in Texas counties, I have provided (1) practical guidance on how to conceptualize and conduct modelbased SAE, (2) a robust framework that can be applied to other health outcomes or geographic levels of aggregation, and (3) HPV vaccine coverage data that may inform the development of health education programs, the provision of health services, the planning of additional research studies, and the creation of local health policies.^
Resumo:
Strategies are compared for the development of a linear regression model with stochastic (multivariate normal) regressor variables and the subsequent assessment of its predictive ability. Bias and mean squared error of four estimators of predictive performance are evaluated in simulated samples of 32 population correlation matrices. Models including all of the available predictors are compared with those obtained using selected subsets. The subset selection procedures investigated include two stopping rules, C$\sb{\rm p}$ and S$\sb{\rm p}$, each combined with an 'all possible subsets' or 'forward selection' of variables. The estimators of performance utilized include parametric (MSEP$\sb{\rm m}$) and non-parametric (PRESS) assessments in the entire sample, and two data splitting estimates restricted to a random or balanced (Snee's DUPLEX) 'validation' half sample. The simulations were performed as a designed experiment, with population correlation matrices representing a broad range of data structures.^ The techniques examined for subset selection do not generally result in improved predictions relative to the full model. Approaches using 'forward selection' result in slightly smaller prediction errors and less biased estimators of predictive accuracy than 'all possible subsets' approaches but no differences are detected between the performances of C$\sb{\rm p}$ and S$\sb{\rm p}$. In every case, prediction errors of models obtained by subset selection in either of the half splits exceed those obtained using all predictors and the entire sample.^ Only the random split estimator is conditionally (on $\\beta$) unbiased, however MSEP$\sb{\rm m}$ is unbiased on average and PRESS is nearly so in unselected (fixed form) models. When subset selection techniques are used, MSEP$\sb{\rm m}$ and PRESS always underestimate prediction errors, by as much as 27 percent (on average) in small samples. Despite their bias, the mean squared errors (MSE) of these estimators are at least 30 percent less than that of the unbiased random split estimator. The DUPLEX split estimator suffers from large MSE as well as bias, and seems of little value within the context of stochastic regressor variables.^ To maximize predictive accuracy while retaining a reliable estimate of that accuracy, it is recommended that the entire sample be used for model development, and a leave-one-out statistic (e.g. PRESS) be used for assessment. ^
Resumo:
OPN is a secreted phosphate containing protein which is expressed by osteoblasts and a variety of other cells in vivo. Data from in vitro studies has accumulated which relates OPN to cellular transformation. We hypothesize that OPN expression is associated with neoplastic disease in humans as suggested by cell culture models. The overall objective of the current study was to determine the tissue distribution of OPN in human malignancy and to determine whether or not a correlation exists between OPN serum levels and malignancy. At the inception of this project, no study had been made demonstrating the relevance of OPN expression with naturally occurring neoplastic disease in humans. To date, few studies have reported OPN distribution in human neoplasia and are limited by either the number of specimens analyzed or the technique used in analysis. In this dissertation study, OPN was purified from human milk and $\alpha$-OPN antiserum developed and characterized. Following antibody development, the distribution and prevalence of OPN in human oral squamous cell carcinoma and human prostate carcinoma was evaluated using immunohistochemical localization. OPN immunolocalization was found in a high percentage of oral epithelial dysplasia and oral squamous cell carcinoma in humans. One oral squamous cell carcinoma cells line, UMSCC-1, was found to express OPN mRNA using Northern blotting. OPN localized to a high percentage of primary prostate adenocarcinomas. OPN localized to 52% of androgen dependent cases and 100% of androgen independent cases. Androgen dependent cell lines such as LNCap and NbE showed minimal OPN mRNA expression while the androgen independent lines C4-2 and PC3 produced ample OPN mRNA. An OPN sandwich assay was developed and used to determine the serum level of OPN in normal males, patients with BPH (benign prostate hypertrophy), and patients with prostate carcinoma. No statistically significant difference was found in OPN serum levels among the three groups. However, a trend of increasing OPN in the serum was noted in patients with BPH and prostate cancer. ^
Resumo:
Nuclear morphometry (NM) uses image analysis to measure features of the cell nucleus which are classified as: bulk properties, shape or form, and DNA distribution. Studies have used these measurements as diagnostic and prognostic indicators of disease with inconclusive results. The distributional properties of these variables have not been systematically investigated although much of the medical data exhibit nonnormal distributions. Measurements are done on several hundred cells per patient so summary measurements reflecting the underlying distribution are needed.^ Distributional characteristics of 34 NM variables from prostate cancer cells were investigated using graphical and analytical techniques. Cells per sample ranged from 52 to 458. A small sample of patients with benign prostatic hyperplasia (BPH), representing non-cancer cells, was used for general comparison with the cancer cells.^ Data transformations such as log, square root and 1/x did not yield normality as measured by the Shapiro-Wilks test for normality. A modulus transformation, used for distributions having abnormal kurtosis values, also did not produce normality.^ Kernel density histograms of the 34 variables exhibited non-normality and 18 variables also exhibited bimodality. A bimodality coefficient was calculated and 3 variables: DNA concentration, shape and elongation, showed the strongest evidence of bimodality and were studied further.^ Two analytical approaches were used to obtain a summary measure for each variable for each patient: cluster analysis to determine significant clusters and a mixture model analysis using a two component model having a Gaussian distribution with equal variances. The mixture component parameters were used to bootstrap the log likelihood ratio to determine the significant number of components, 1 or 2. These summary measures were used as predictors of disease severity in several proportional odds logistic regression models. The disease severity scale had 5 levels and was constructed of 3 components: extracapsulary penetration (ECP), lymph node involvement (LN+) and seminal vesicle involvement (SV+) which represent surrogate measures of prognosis. The summary measures were not strong predictors of disease severity. There was some indication from the mixture model results that there were changes in mean levels and proportions of the components in the lower severity levels. ^
Resumo:
Previous studies of normal children have linked body fat but not body fat distribution (BFD), to higher blood pressures, lipids, and insulin resistance (Berenson et al., 1988) BFD is a well-established risk factor for cardiovascular disease in adults (Björntorp, 1988). This study investigates the relation of BFD and serum lipids at baseline in children from Project HeartBeat!, a study of the growth and development of cardiovascular risk factors in 678 children in three cohorts measured initially at ages 8, 11, and 14 years. Initially, two of four indices of BFD were significantly related to the lipids: ratio of upper to lower body skinfolds (ln US:LS) and conicity (C Index). A factor analysis reduced the information in the serum lipids to two vectors: (1) total cholesterol + LDL-cholesterol and (2) HDL-cholesterol − triglycerides, which together accounted for 85% of the lipid variation. Using each serum lipid and vector as separate dependent variables, linear and quadratic regression models were constructed to examine the predictive ability of the two BFD variables, controlling for total body fat, gender, ethnicity (Black, non-Black) and maturation. Linear models provided an acceptable fit. Percent body fat (%BF) was a significant predictor in each and every lipid model, independent of age, maturation, or ethnicity (p ≤ 0.05). No BFD variable entered the equation for total or LDL-cholesterol, although there was a significant maturity by BFD interaction for LDL (ln US:LS was a significant predictor in more mature individuals). Both %BF and BFD (by way of Conicity) were significant predictors of HDL-cholesterol and triglycerides (p ≤ 0.01). All models were statistically significant at a high level (p ≤ 0.01), but adjusted R 2's for all models were low (0.05–0.15). Body fat distribution is a significant predictor of lipids in normal children, but secondarily to %BF, and for LDL-cholesterol in particular, the relation is dependent on maturity status. ^
Resumo:
Current statistical methods for estimation of parametric effect sizes from a series of experiments are generally restricted to univariate comparisons of standardized mean differences between two treatments. Multivariate methods are presented for the case in which effect size is a vector of standardized multivariate mean differences and the number of treatment groups is two or more. The proposed methods employ a vector of independent sample means for each response variable that leads to a covariance structure which depends only on correlations among the $p$ responses on each subject. Using weighted least squares theory and the assumption that the observations are from normally distributed populations, multivariate hypotheses analogous to common hypotheses used for testing effect sizes were formulated and tested for treatment effects which are correlated through a common control group, through multiple response variables observed on each subject, or both conditions.^ The asymptotic multivariate distribution for correlated effect sizes is obtained by extending univariate methods for estimating effect sizes which are correlated through common control groups. The joint distribution of vectors of effect sizes (from $p$ responses on each subject) from one treatment and one control group and from several treatment groups sharing a common control group are derived. Methods are given for estimation of linear combinations of effect sizes when certain homogeneity conditions are met, and for estimation of vectors of effect sizes and confidence intervals from $p$ responses on each subject. Computational illustrations are provided using data from studies of effects of electric field exposure on small laboratory animals. ^
Resumo:
Body fat distribution is a cardiovascular health risk factor in adults. Body fat distribution can be measured through various methods including anthropometry. It is not clear which anthropometric index is suitable for epidemiologic studies of fat distribution and cardiovascular disease. The purpose of the present study was to select a measure of body fat distribution from among a series of indices (those traditionally used in the literature and others constructed from the analysis) that is most highly correlated with lipid-related variables and is independent of overall fatness. Subjects were Mexican-American men and women (N = 1004) from a study of gallbladder disease in Starr County, Texas. Multivariate associations were sought between lipid profile measures (lipids, lipoproteins, and apolipoproteins) and two sets of anthropometric variables (4 circumferences and 6 skinfolds). This was done to assess the association between lipid-related measures and the two sets of anthropometric variables and guide the construction of indices.^ Two indices emerged from the analysis that seemed to be highly correlated with lipid profile measures independent of obesity. These indices are: 2*arm circumference-thigh skinfold in pre- and post-menopausal women and arm/thigh circumference ratio in men. Next, using the sum of all skinfolds to represent obesity and the selected body fat distribution indices, the following hypotheses were tested: (1) state of obesity and centrally/upper distributed body fat are equally predictive of lipids, lipoproteins and apolipoproteins, and (2) the correlation among the lipid-related measures is not altered by obesity and body fat distribution.^ With respect to the first hypothesis, the present study found that most lipids, lipoproteins and apolipoproteins were significantly associated with both overall fatness and anatomical location of body fat in both sex and menopausal groups. However, within men and post-menopausal women, certain lipid profile measures (triglyceride and HDLT among post-menopausal women and apos C-II, CIII, and E among men) had substantially higher correlation with body fat distribution as compared with overall fatness.^ With respect to the second hypothesis, both obesity and body fat distribution were found to alter the association among plasma lipid variables in men and women. There was a suggestion from the data that the pattern of correlations among men and post-menopausal women are more comparable. Among men correlations involving apo A-I, HDLT, and HDL$\sb2$ seemed greatly influenced by obesity, and A-II by fat distribution; among post-menopausal women correlations involving apos A-I and A-II were highly affected by the location of body fat.^ Thus, these data point out that not only can obesity and fat distribution affect levels of single measures, they also can markedly influence the pattern of relationship among measures. The fact that such changes are seen for both obesity and fat distribution is significant, since the indices employed were chosen because they were independent of one another. ^
Resumo:
Background: The physical characteristic of protons is that they deliver most of their radiation dose to the target volume and deliver no dose to the normal tissue distal to the tumor. Previously, numerous studies have shown unique advantages of proton therapy over intensity-modulated radiation therapy (IMRT) in conforming dose to the tumor and sparing dose to the surrounding normal tissues and the critical structures in many clinical sites. However, proton therapy is known to be more sensitive to treatment uncertainties such as inter- and intra-fractional variations in patient anatomy. To date, no study has clearly demonstrated the effectiveness of proton therapy compared with the conventional IMRT under the consideration of both respiratory motion and tumor shrinkage in non-small cell lung cancer (NSCLC) patients. Purpose: This thesis investigated two questions for establishing a clinically relevant comparison of the two different modalities (IMRT and proton therapy). The first question was whether or not there are any differences in tumor shrinkage between patients randomized to IMRT versus passively scattered proton therapy (PSPT). Tumor shrinkage is considered a standard measure of radiation therapy response that has been widely used to gauge a short-term progression of radiation therapy. The second question was whether or not there are any differences between the planned dose and 5D dose under the influence of inter- and intra-fractional variations in the patient anatomy for both modalities. Methods: A total of 45 patients (25 IMRT patients and 20 PSPT patients) were used to quantify the tumor shrinkage in terms of the change of the primary gross tumor volume (GTVp). All patients were randomized to receive either IMRT or PSPT for NSCLC. Treatment planning goals were identical for both groups. All patients received 5 to 8 weekly repeated 4-dimensional computed tomography (4DCT) scans during the course of radiation treatments. The original GTVp contours were propagated to T50 of weekly 4DCT images using deformable image registration and their absolute volumes were measured. Statistical analysis was performed to compare the distribution of tumor shrinkage between the two population groups. In order to investigate the difference between the planned dose and the 5D dose with consideration of both breathing motion and anatomical change, we re-calculated new dose distributions at every phase of the breathing cycle for all available weekly 4DCT data sets which resulted 50 to 80 individual dose calculations for each of the 7 patients presented in this thesis. The newly calculated dose distributions were then deformed and accumulated to T50 of the planning 4DCT for comparison with the planned dose distribution. Results: At the end of the treatment, both IMRT and PSPT groups showed mean tumor volume reductions of 23.6% ( 19.2%) and 20.9% ( 17.0 %) respectively. Moreover, the mean difference in tumor shrinkage between two groups is 3% along with the corresponding 95% confidence interval, [-8%, 14%]. The rate of tumor shrinkage was highly correlated with the initial tumor volume size. For the planning dose and 5D dose comparison study, all 7 patients showed a mean difference of 1 % in terms of target coverage for both IMRT and PSPT treatment plans. Conclusions: The results of the tumor shrinkage investigation showed no statistically significant difference in tumor shrinkage between the IMRT and PSPT patients, and the tumor shrinkage between the two modalities is similar based on the 95% confidence interval. From the pilot study of comparing the planned dose with the 5D dose, we found the difference to be only 1%. Overall impression of the two modalities in terms of treatment response as measured by the tumor shrinkage and 5D dose under the influence of anatomical change that were designed under the same protocol (i.e. randomized trial) showed similar result.