992 resultados para Multiple Imputation
Resumo:
Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^
Resumo:
Objective: The purpose of this study is to compare the stages of breast cancer presented between the insured and uninsured patients diagnosed at The Rose, an active non-profit breast healthcare organization to determine if uninsured patients present with more advanced stage breast cancer as compared to their insured counterparts. ^ Study Design: Retrospective cross-sectional study. ^ Methods: The study included 1,265 patients who received breast healthcare services and were diagnosed with breast cancer at The Rose between FY 2007 and FY 2012. 738 of the patients in the study were presumably uninsured since their breast healthcare services were sponsored through various funding sources and they were navigated into treatment through The Rose patient navigation program. We compared breast cancer stages for women who had insurance with those who did not have insurance. The effects of age and race/ethnicity along with the insurance status on the stage of reast cancer diagnosis were also analyzed. We calculated the odds ratio using the contingency tables; and estimated odds ratios (ORs) and 95% confidence intervals (CIs) using ordinal logistic regression by applying multiple imputation method for missing tumor stage data. ^ Results: The ordered logistic regression analysis with ordered tumor stage as dependent variable and uninsured as independent variable gave us an odds ratio of 1.73 (OR=1.73; p-value<0.05; 95% CI: 1.36 - 2.12). ^ Conclusions: Insurance status is a strong predictor of stage of breast cancer diagnosed among women seen at The Rose. Uninsured women seen at The Rose are almost twice as likely to present at a advanced stage of breast cancer as opposed to their insured counterparts.^
Resumo:
Geralmente, nos experimentos genótipo por ambiente (G × E) é comum observar o comportamento dos genótipos em relação a distintos atributos nos ambientes considerados. A análise deste tipo de experimentos tem sido abordada amplamente para o caso de um único atributo. Nesta tese são apresentadas algumas alternativas de análise considerando genótipos, ambientes e atributos simultaneamente. A primeira, é baseada no método de mistura de máxima verossimilhança de agrupamento - Mixclus e a análise de componentes principais de 3 modos - 3MPCA, que permitem a análise de tabelas de tripla entrada, estes dois métodos têm sido muito usados na área da psicologia e da química, mas pouco na agricultura. A segunda, é uma metodologia que combina, o modelo de efeitos aditivos com interação multiplicativa - AMMI, modelo eficiente para a análise de experimentos (G × E) com um atributo e a análise de procrustes generalizada, que permite comparar configurações de pontos e proporcionar uma medida numérica de quanto elas diferem. Finalmente, é apresentada uma alternativa para realizar imputação de dados nos experimentos (G × E), pois, uma situação muito frequente nestes experimentos, é a presença de dados faltantes. Conclui-se que as metodologias propostas constituem ferramentas úteis para a análise de experimentos (G × E) multiatributo.
Resumo:
As análises biplot que utilizam os modelos de efeitos principais aditivos com inter- ação multiplicativa (AMMI) requerem matrizes de dados completas, mas, frequentemente os ensaios multiambientais apresentam dados faltantes. Nesta tese são propostas novas metodologias de imputação simples e múltipla que podem ser usadas para analisar da- dos desbalanceados em experimentos com interação genótipo por ambiente (G×E). A primeira, é uma nova extensão do método de validação cruzada por autovetor (Bro et al, 2008). A segunda, corresponde a um novo algoritmo não-paramétrico obtido por meio de modificações no método de imputação simples desenvolvido por Yan (2013). Também é incluído um estudo que considera sistemas de imputação recentemente relatados na literatura e os compara com o procedimento clássico recomendado para imputação em ensaios (G×E), ou seja, a combinação do algoritmo de Esperança-Maximização com os modelos AMMI ou EM-AMMI. Por último, são fornecidas generalizações da imputação simples descrita por Arciniegas-Alarcón et al. (2010) que mistura regressão com aproximação de posto inferior de uma matriz. Todas as metodologias têm como base a decomposição por valores singulares (DVS), portanto, são livres de pressuposições distribucionais ou estruturais. Para determinar o desempenho dos novos esquemas de imputação foram realizadas simulações baseadas em conjuntos de dados reais de diferentes espécies, com valores re- tirados aleatoriamente em diferentes porcentagens e a qualidade das imputações avaliada com distintas estatísticas. Concluiu-se que a DVS constitui uma ferramenta útil e flexível na construção de técnicas eficientes que contornem o problema de perda de informação em matrizes experimentais.
Resumo:
IMPORTANCE Obesity is a risk factor for deep vein thrombosis of the leg and pulmonary embolism. To date, however, whether obesity is associated with adult cerebral venous thrombosis (CVT) has not been assessed. OBJECTIVE To assess whether obesity is a risk factor for CVT. DESIGN, SETTING, AND PARTICIPANTS A case-control study was performed in consecutive adult patients with CVT admitted from July 1, 2006 (Amsterdam), and October 1, 2009 (Berne), through December 31, 2014, to the Academic Medical Center in Amsterdam, the Netherlands, or Inselspital University Hospital in Berne, Switzerland. The control group was composed of individuals from the control population of the Multiple Environmental and Genetic Assessment of Risk Factors for Venous Thrombosis study, which was a large Dutch case-control study performed from March 1, 1999, to September 31, 2004, and in which risk factors for deep vein thrombosis and pulmonary embolism were assessed. Data analysis was performed from January 2 to July 12, 2015. MAIN OUTCOMES AND MEASURES Obesity was determined by body mass index (BMI). A BMI of 30 or greater was considered to indicate obesity, and a BMI of 25 to 29.99 was considered to indicate overweight. A multiple imputation procedure was used for missing data. We adjusted for sex, age, history of cancer, ethnicity, smoking status, and oral contraceptive use. Individuals with normal weight (BMI <25) were the reference category. RESULTS The study included 186 cases and 6134 controls. Cases were younger (median age, 40 vs 48 years), more often female (133 [71.5%] vs 3220 [52.5%]), more often used oral contraceptives (97 [72.9%] vs 758 [23.5%] of women), and more frequently had a history of cancer (17 [9.1%] vs 235 [3.8%]) compared with controls. Obesity (BMI ≥30) was associated with an increased risk of CVT (adjusted odds ratio [OR], 2.63; 95% CI, 1.53-4.54). Stratification by sex revealed a strong association between CVT and obesity in women (adjusted OR, 3.50; 95% CI, 2.00-6.14) but not in men (adjusted OR, 1.16; 95% CI, 0.25-5.30). Further stratification revealed that, in women who used oral contraceptives, overweight and obesity were associated with an increased risk of CVT in a dose-dependent manner (BMI 25.0-29.9: adjusted OR, 11.87; 95% CI, 5.94-23.74; BMI ≥30: adjusted OR, 29.26; 95% CI, 13.47-63.60). No association was found in women who did not use oral contraceptives. CONCLUSIONS AND RELEVANCE Obesity is a strong risk factor for CVT in women who use oral contraceptives.
Resumo:
Over the last two decades social vulnerability has emerged as a major area of study, with increasing attention to the study of vulnerable populations. Generally, the elderly are among the most vulnerable members of any society, and widespread population aging has led to greater focus on elderly vulnerability. However, the absence of a valid and practical measure constrains the ability of policy-makers to address this issue in a comprehensive way. This study developed a composite indicator, The Elderly Social Vulnerability Index (ESVI), and used it to undertake a comparative analysis of the availability of support for elderly Jamaicans based on their access to human, material and social resources. The results of the ESVI indicated that while the elderly are more vulnerable overall, certain segments of the population appear to be at greater risk. Females had consistently lower scores than males, and the oldest-old had the highest scores of all groups of older persons. Vulnerability scores also varied according to place of residence, with more rural parishes having higher scores than their urban counterparts. These findings support the political economy framework which locates disadvantage in old age within political and ideological structures. The findings also point to the pervasiveness and persistence of gender inequality as argued by feminist theories of aging. Based on the results of the study it is clear that there is a need for policies that target specific population segments, in addition to universal policies that could make the experience of old age less challenging for the majority of older persons. Overall, the ESVI has displayed usefulness as a tool for theoretical analysis and demonstrated its potential as a policy instrument to assist decision-makers in determining where to target their efforts as they seek to address the issue of social vulnerability in old age. Data for this study came from the 2001 population and housing census of Jamaica, with multiple imputation for missing data. The index was derived from the linear aggregation of three equally weighted domains, comprised of eleven unweighted indicators which were normalized using z-scores. Indicators were selected based on theoretical relevance and data availability.
Resumo:
This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH+4 (and NONO-3) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH+4 and/or NO-3-forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH+4/NO-3 ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010, doi:10.1175/2009EI288.1) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 Tg-N to 110 Tg-N during 1961-2010. On the other hand, the global NO-3 input started to decline after the late 1980s and the fraction of NO-3 in global N fertilizer decreased consistently from 35 % to 13 % over a 50-year period. NH+4 based fertilizers are dominant in most countries; however, the NH+4/NO-3 ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as an input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes.
Resumo:
Abstract
Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.
The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.
The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.
The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.
Resumo:
Over the last two decades social vulnerability has emerged as a major area of study, with increasing attention to the study of vulnerable populations. Generally, the elderly are among the most vulnerable members of any society, and widespread population aging has led to greater focus on elderly vulnerability. However, the absence of a valid and practical measure constrains the ability of policy-makers to address this issue in a comprehensive way. This study developed a composite indicator, The Elderly Social Vulnerability Index (ESVI), and used it to undertake a comparative analysis of the availability of support for elderly Jamaicans based on their access to human, material and social resources. The results of the ESVI indicated that while the elderly are more vulnerable overall, certain segments of the population appear to be at greater risk. Females had consistently lower scores than males, and the oldest-old had the highest scores of all groups of older persons. Vulnerability scores also varied according to place of residence, with more rural parishes having higher scores than their urban counterparts. These findings support the political economy framework which locates disadvantage in old age within political and ideological structures. The findings also point to the pervasiveness and persistence of gender inequality as argued by feminist theories of aging. Based on the results of the study it is clear that there is a need for policies that target specific population segments, in addition to universal policies that could make the experience of old age less challenging for the majority of older persons. Overall, the ESVI has displayed usefulness as a tool for theoretical analysis and demonstrated its potential as a policy instrument to assist decision-makers in determining where to target their efforts as they seek to address the issue of social vulnerability in old age. Data for this study came from the 2001 population and housing census of Jamaica, with multiple imputation for missing data. The index was derived from the linear aggregation of three equally weighted domains, comprised of eleven unweighted indicators which were normalized using z-scores. Indicators were selected based on theoretical relevance and data availability.
Resumo:
Snapper (Pagrus auratus) is widely distributed throughout subtropical and temperate southern oceans and forms a significant recreational and commercial fishery in Queensland, Australia. Using data from government reports, media sources, popular publications and a government fisheries survey carried out in 1910, we compiled information on individual snapper fishing trips that took place prior to the commencement of fisherywide organized data collection, from 1871 to 1939. In addition to extracting all available quantitative data, we translated qualitative information into bounded estimates and used multiple imputation to handle missing values, forming 287 records for which catch rate (snapper fisher−1 h−1) could be derived. Uncertainty was handled through a parametric maximum likelihood framework (a transformed trivariate Gaussian), which facilitated statistical comparisons between data sources. No statistically significant differences in catch rates were found among media sources and the government fisheries survey. Catch rates remained stable throughout the time series, averaging 3.75 snapper fisher−1 h−1 (95% confidence interval, 3.42–4.09) as the fishery expanded into new grounds. In comparison, a contemporary (1993–2002) south-east Queensland charter fishery produced an average catch rate of 0.4 snapper fisher−1 h−1 (95% confidence interval, 0.31–0.58). These data illustrate the productivity of a fishery during its earliest years of development and represent the earliest catch rate data globally for this species. By adopting a formalized approach to address issues common to many historical records – missing data, a lack of quantitative information and reporting bias – our analysis demonstrates the potential for historical narratives to contribute to contemporary fisheries management.
Resumo:
To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ≤1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset.
Resumo:
Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.
Resumo:
In the assignment game of Shapley and Shubik [Shapley, L.S., Shubik, M., 1972. The assignment game. I. The core, International journal of Game Theory 1, 11-130] agents are allowed to form one partnership at most. That paper proves that, in the context of firms and workers, given two stable payoffs for the firms there is a stable payoff which gives each firm the larger of the two amounts and also one which gives each of them the smaller amount. Analogous result applies to the workers. Sotomayor [Sotomayor, M., 1992. The multiple partners game. In: Majumdar, M. (Ed.), Dynamics and Equilibrium: Essays in Honor to D. Gale. Mcmillian, pp. 322-336] extends this analysis to the case where both types of agents may form more than one partnership and an agent`s payoff is multi-dimensional. Instead, this note concentrates in the total payoff of the agents. It is then proved the rather unexpected result that again the maximum of any pair of stable payoffs for the firms is stable but the minimum need not be, even if we restrict the multiplicity of partnerships to one of the sides. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
The amount of biological data has grown exponentially in recent decades. Modern biotechnologies, such as microarrays and next-generation sequencing, are capable to produce massive amounts of biomedical data in a single experiment. As the amount of the data is rapidly growing there is an urgent need for reliable computational methods for analyzing and visualizing it. This thesis addresses this need by studying how to efficiently and reliably analyze and visualize high-dimensional data, especially that obtained from gene expression microarray experiments. First, we will study the ways to improve the quality of microarray data by replacing (imputing) the missing data entries with the estimated values for these entries. Missing value imputation is a method which is commonly used to make the original incomplete data complete, thus making it easier to be analyzed with statistical and computational methods. Our novel approach was to use curated external biological information as a guide for the missing value imputation. Secondly, we studied the effect of missing value imputation on the downstream data analysis methods like clustering. We compared multiple recent imputation algorithms against 8 publicly available microarray data sets. It was observed that the missing value imputation indeed is a rational way to improve the quality of biological data. The research revealed differences between the clustering results obtained with different imputation methods. On most data sets, the simple and fast k-NN imputation was good enough, but there were also needs for more advanced imputation methods, such as Bayesian Principal Component Algorithm (BPCA). Finally, we studied the visualization of biological network data. Biological interaction networks are examples of the outcome of multiple biological experiments such as using the gene microarray techniques. Such networks are typically very large and highly connected, thus there is a need for fast algorithms for producing visually pleasant layouts. A computationally efficient way to produce layouts of large biological interaction networks was developed. The algorithm uses multilevel optimization within the regular force directed graph layout algorithm.
Resumo:
The cerebellum is an important site for cortical demyelination in multiple sclerosis, but the functional significance of this finding is not fully understood. To evaluate the clinical and cognitive impact of cerebellar grey-matter pathology in multiple sclerosis patients. Forty-two relapsing-remitting multiple sclerosis patients and 30 controls underwent clinical assessment including the Multiple Sclerosis Functional Composite, Expanded Disability Status Scale (EDSS) and cerebellar functional system (FS) score, and cognitive evaluation, including the Paced Auditory Serial Addition Test (PASAT) and the Symbol-Digit Modalities Test (SDMT). Magnetic resonance imaging was performed with a 3T scanner and variables of interest were: brain white-matter and cortical lesion load, cerebellar intracortical and leukocortical lesion volumes, and brain cortical and cerebellar white-matter and grey-matter volumes. After multivariate analysis high burden of cerebellar intracortical lesions was the only predictor for the EDSS (p<0.001), cerebellar FS (p = 0.002), arm function (p = 0.049), and for leg function (p<0.001). Patients with high burden of cerebellar leukocortical lesions had lower PASAT scores (p = 0.013), while patients with greater volumes of cerebellar intracortical lesions had worse SDMT scores (p = 0.015). Cerebellar grey-matter pathology is widely present and contributes to clinical dysfunction in relapsing-remitting multiple sclerosis patients, independently of brain grey-matter damage.