949 resultados para monotone missing data


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Attrition in longitudinal studies can lead to biased results. The study is motivated by the unexpected observation that alcohol consumption decreased despite increased availability, which may be due to sample attrition of heavy drinkers. Several imputation methods have been proposed, but rarely compared in longitudinal studies of alcohol consumption. The imputation of consumption level measurements is computationally particularly challenging due to alcohol consumption being a semi-continuous variable (dichotomous drinking status and continuous volume among drinkers), and the non-normality of data in the continuous part. Data come from a longitudinal study in Denmark with four waves (2003-2006) and 1771 individuals at baseline. Five techniques for missing data are compared: Last value carried forward (LVCF) was used as a single, and Hotdeck, Heckman modelling, multivariate imputation by chained equations (MICE), and a Bayesian approach as multiple imputation methods. Predictive mean matching was used to account for non-normality, where instead of imputing regression estimates, "real" observed values from similar cases are imputed. Methods were also compared by means of a simulated dataset. The simulation showed that the Bayesian approach yielded the most unbiased estimates for imputation. The finding of no increase in consumption levels despite a higher availability remained unaltered. Copyright (C) 2011 John Wiley & Sons, Ltd.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

For the last 2 decades, supertree reconstruction has been an active field of research and has seen the development of a large number of major algorithms. Because of the growing popularity of the supertree methods, it has become necessary to evaluate the performance of these algorithms to determine which are the best options (especially with regard to the supermatrix approach that is widely used). In this study, seven of the most commonly used supertree methods are investigated by using a large empirical data set (in terms of number of taxa and molecular markers) from the worldwide flowering plant family Sapindaceae. Supertree methods were evaluated using several criteria: similarity of the supertrees with the input trees, similarity between the supertrees and the total evidence tree, level of resolution of the supertree and computational time required by the algorithm. Additional analyses were also conducted on a reduced data set to test if the performance levels were affected by the heuristic searches rather than the algorithms themselves. Based on our results, two main groups of supertree methods were identified: on one hand, the matrix representation with parsimony (MRP), MinFlip, and MinCut methods performed well according to our criteria, whereas the average consensus, split fit, and most similar supertree methods showed a poorer performance or at least did not behave the same way as the total evidence tree. Results for the super distance matrix, that is, the most recent approach tested here, were promising with at least one derived method performing as well as MRP, MinFlip, and MinCut. The output of each method was only slightly improved when applied to the reduced data set, suggesting a correct behavior of the heuristic searches and a relatively low sensitivity of the algorithms to data set sizes and missing data. Results also showed that the MRP analyses could reach a high level of quality even when using a simple heuristic search strategy, with the exception of MRP with Purvis coding scheme and reversible parsimony. The future of supertrees lies in the implementation of a standardized heuristic search for all methods and the increase in computing power to handle large data sets. The latter would prove to be particularly useful for promising approaches such as the maximum quartet fit method that yet requires substantial computing power.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

BACKGROUND: Artemisinin-resistant Plasmodium falciparum has emerged in the Greater Mekong sub-region and poses a major global public health threat. Slow parasite clearance is a key clinical manifestation of reduced susceptibility to artemisinin. This study was designed to establish the baseline values for clearance in patients from Sub-Saharan African countries with uncomplicated malaria treated with artemisinin-based combination therapies (ACTs). METHODS: A literature review in PubMed was conducted in March 2013 to identify all prospective clinical trials (uncontrolled trials, controlled trials and randomized controlled trials), including ACTs conducted in Sub-Saharan Africa, between 1960 and 2012. Individual patient data from these studies were shared with the WorldWide Antimalarial Resistance Network (WWARN) and pooled using an a priori statistical analytical plan. Factors affecting early parasitological response were investigated using logistic regression with study sites fitted as a random effect. The risk of bias in included studies was evaluated based on study design, methodology and missing data. RESULTS: In total, 29,493 patients from 84 clinical trials were included in the analysis, treated with artemether-lumefantrine (n = 13,664), artesunate-amodiaquine (n = 11,337) and dihydroartemisinin-piperaquine (n = 4,492). The overall parasite clearance rate was rapid. The parasite positivity rate (PPR) decreased from 59.7 % (95 % CI: 54.5-64.9) on day 1 to 6.7 % (95 % CI: 4.8-8.7) on day 2 and 0.9 % (95 % CI: 0.5-1.2) on day 3. The 95th percentile of observed day 3 PPR was 5.3 %. Independent risk factors predictive of day 3 positivity were: high baseline parasitaemia (adjusted odds ratio (AOR) = 1.16 (95 % CI: 1.08-1.25); per 2-fold increase in parasite density, P <0.001); fever (>37.5 °C) (AOR = 1.50 (95 % CI: 1.06-2.13), P = 0.022); severe anaemia (AOR = 2.04 (95 % CI: 1.21-3.44), P = 0.008); areas of low/moderate transmission setting (AOR = 2.71 (95 % CI: 1.38-5.36), P = 0.004); and treatment with the loose formulation of artesunate-amodiaquine (AOR = 2.27 (95 % CI: 1.14-4.51), P = 0.020, compared to dihydroartemisinin-piperaquine). CONCLUSIONS: The three ACTs assessed in this analysis continue to achieve rapid early parasitological clearance across the sites assessed in Sub-Saharan Africa. A threshold of 5 % day 3 parasite positivity from a minimum sample size of 50 patients provides a more sensitive benchmark in Sub-Saharan Africa compared to the current recommended threshold of 10 % to trigger further investigation of artemisinin susceptibility.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster, Laird, and Rubin 1977)---both for the estimation of mixture components and for coping with the missing data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Multi-factor models constitute a useful tool to explain cross-sectional covariance in equities returns. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide interval forecasts.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Multi-factor models constitute a use fui tool to explain cross-sectional covariance in equities retums. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide intervaI forecasts.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper proposes a filter based on a general regression neural network and a moving average filter, for preprocessing half-hourly load data for short-term multinodal load forecasting, discussed in another paper. Tests made with half-hourly load data from nine New Zealand electrical substations demonstrate that this filter is able to handle noise, missing data and abnormal data. © 2011 IEEE.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

[EN] The information provided by the International Commission for the Conservation of Atlantic Tunas (ICCAT) on captures of skipjack tuna (Katsuwonus pelamis) in the central-east Atlantic has a number of limitations, such as gaps in the statistics for certain fleets and the level of spatiotemporal detail at which catches are reported. As a result, the quality of these data and their effectiveness for providing management advice is limited. In order to reconstruct missing spatiotemporal data of catches, the present study uses Data INterpolating Empirical Orthogonal Functions (DINEOF), a technique for missing data reconstruction, applied here for the first time to fisheries data. DINEOF is based on an Empirical Orthogonal Functions decomposition performed with a Lanczos method. DINEOF was tested with different amounts of missing data, intentionally removing values from 3.4% to 95.2% of data loss, and then compared with the same data set with no missing data. These validation analyses show that DINEOF is a reliable methodological approach of data reconstruction for the purposes of fishery management advice, even when the amount of missing data is very high.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

OBJECTIVE: To describe the electronic medical databases used in antiretroviral therapy (ART) programmes in lower-income countries and assess the measures such programmes employ to maintain and improve data quality and reduce the loss of patients to follow-up. METHODS: In 15 countries of Africa, South America and Asia, a survey was conducted from December 2006 to February 2007 on the use of electronic medical record systems in ART programmes. Patients enrolled in the sites at the time of the survey but not seen during the previous 12 months were considered lost to follow-up. The quality of the data was assessed by computing the percentage of missing key variables (age, sex, clinical stage of HIV infection, CD4+ lymphocyte count and year of ART initiation). Associations between site characteristics (such as number of staff members dedicated to data management), measures to reduce loss to follow-up (such as the presence of staff dedicated to tracing patients) and data quality and loss to follow-up were analysed using multivariate logit models. FINDINGS: Twenty-one sites that together provided ART to 50 060 patients were included (median number of patients per site: 1000; interquartile range, IQR: 72-19 320). Eighteen sites (86%) used an electronic database for medical record-keeping; 15 (83%) such sites relied on software intended for personal or small business use. The median percentage of missing data for key variables per site was 10.9% (IQR: 2.0-18.9%) and declined with training in data management (odds ratio, OR: 0.58; 95% confidence interval, CI: 0.37-0.90) and weekly hours spent by a clerk on the database per 100 patients on ART (OR: 0.95; 95% CI: 0.90-0.99). About 10 weekly hours per 100 patients on ART were required to reduce missing data for key variables to below 10%. The median percentage of patients lost to follow-up 1 year after starting ART was 8.5% (IQR: 4.2-19.7%). Strategies to reduce loss to follow-up included outreach teams, community-based organizations and checking death registry data. Implementation of all three strategies substantially reduced losses to follow-up (OR: 0.17; 95% CI: 0.15-0.20). CONCLUSION: The quality of the data collected and the retention of patients in ART treatment programmes are unsatisfactory for many sites involved in the scale-up of ART in resource-limited settings, mainly because of insufficient staff trained to manage data and trace patients lost to follow-up.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Principal Component Analysis (PCA) is a popular method for dimension reduction that can be used in many fields including data compression, image processing, exploratory data analysis, etc. However, traditional PCA method has several drawbacks, since the traditional PCA method is not efficient for dealing with high dimensional data and cannot be effectively applied to compute accurate enough principal components when handling relatively large portion of missing data. In this report, we propose to use EM-PCA method for dimension reduction of power system measurement with missing data, and provide a comparative study of traditional PCA and EM-PCA methods. Our extensive experimental results show that EM-PCA method is more effective and more accurate for dimension reduction of power system measurement data than traditional PCA method when dealing with large portion of missing data set.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Most statistical analysis, theory and practice, is concerned with static models; models with a proposed set of parameters whose values are fixed across observational units. Static models implicitly assume that the quantified relationships remain the same across the design space of the data. While this is reasonable under many circumstances this can be a dangerous assumption when dealing with sequentially ordered data. The mere passage of time always brings fresh considerations and the interrelationships among parameters, or subsets of parameters, may need to be continually revised. ^ When data are gathered sequentially dynamic interim monitoring may be useful as new subject-specific parameters are introduced with each new observational unit. Sequential imputation via dynamic hierarchical models is an efficient strategy for handling missing data and analyzing longitudinal studies. Dynamic conditional independence models offers a flexible framework that exploits the Bayesian updating scheme for capturing the evolution of both the population and individual effects over time. While static models often describe aggregate information well they often do not reflect conflicts in the information at the individual level. Dynamic models prove advantageous over static models in capturing both individual and aggregate trends. Computations for such models can be carried out via the Gibbs sampler. An application using a small sample repeated measures normally distributed growth curve data is presented. ^

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Next-generation DNA sequencing platforms can effectively detect the entire spectrum of genomic variation and is emerging to be a major tool for systematic exploration of the universe of variants and interactions in the entire genome. However, the data produced by next-generation sequencing technologies will suffer from three basic problems: sequence errors, assembly errors, and missing data. Current statistical methods for genetic analysis are well suited for detecting the association of common variants, but are less suitable to rare variants. This raises great challenge for sequence-based genetic studies of complex diseases.^ This research dissertation utilized genome continuum model as a general principle, and stochastic calculus and functional data analysis as tools for developing novel and powerful statistical methods for next generation of association studies of both qualitative and quantitative traits in the context of sequencing data, which finally lead to shifting the paradigm of association analysis from the current locus-by-locus analysis to collectively analyzing genome regions.^ In this project, the functional principal component (FPC) methods coupled with high-dimensional data reduction techniques will be used to develop novel and powerful methods for testing the associations of the entire spectrum of genetic variation within a segment of genome or a gene regardless of whether the variants are common or rare.^ The classical quantitative genetics suffer from high type I error rates and low power for rare variants. To overcome these limitations for resequencing data, this project used functional linear models with scalar response to develop statistics for identifying quantitative trait loci (QTLs) for both common and rare variants. To illustrate their applications, the functional linear models were applied to five quantitative traits in Framingham heart studies. ^ This project proposed a novel concept of gene-gene co-association in which a gene or a genomic region is taken as a unit of association analysis and used stochastic calculus to develop a unified framework for testing the association of multiple genes or genomic regions for both common and rare alleles. The proposed methods were applied to gene-gene co-association analysis of psoriasis in two independent GWAS datasets which led to discovery of networks significantly associated with psoriasis.^

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The paper investigates a Bayesian hierarchical model for the analysis of categorical longitudinal data from a large social survey of immigrants to Australia. Data for each subject are observed on three separate occasions, or waves, of the survey. One of the features of the data set is that observations for some variables are missing for at least one wave. A model for the employment status of immigrants is developed by introducing, at the first stage of a hierarchical model, a multinomial model for the response and then subsequent terms are introduced to explain wave and subject effects. To estimate the model, we use the Gibbs sampler, which allows missing data for both the response and the explanatory variables to be imputed at each iteration of the algorithm, given some appropriate prior distributions. After accounting for significant covariate effects in the model, results show that the relative probability of remaining unemployed diminished with time following arrival in Australia.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Objective: An estimation of cut-off points for the diagnosis of diabetes mellitus (DM) based on individual risk factors. Methods: A subset of the 1991 Oman National Diabetes Survey is used, including all patients with a 2h post glucose load >= 200 mg/dl (278 subjects) and a control group of 286 subjects. All subjects previously diagnosed as diabetic and all subjects with missing data values were excluded. The data set was analyzed by use of the SPSS Clementine data mining system. Decision Tree Learners (C5 and CART) and a method for mining association rules (the GRI algorithm) are used. The fasting plasma glucose (FPG), age, sex, family history of diabetes and body mass index (BMI) are input risk factors (independent variables), while diabetes onset (the 2h post glucose load >= 200 mg/dl) is the output (dependent variable). All three techniques used were tested by use of crossvalidation (89.8%). Results: Rules produced for diabetes diagnosis are: A- GRI algorithm (1) FPG>=108.9 mg/dl, (2) FPG>=107.1 and age>39.5 years. B- CART decision trees: FPG >=110.7 mg/dl. C- The C5 decision tree learner: (1) FPG>=95.5 and 54, (2) FPG>=106 and 25.2 kg/m2. (3) FPG>=106 and =133 mg/dl. The three techniques produced rules which cover a significant number of cases (82%), with confidence between 74 and 100%. Conclusion: Our approach supports the suggestion that the present cut-off value of fasting plasma glucose (126 mg/dl) for the diagnosis of diabetes mellitus needs revision, and the individual risk factors such as age and BMI should be considered in defining the new cut-off value.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Visualising data for exploratory analysis is a major challenge in many applications. Visualisation allows scientists to gain insight into the structure and distribution of the data, for example finding common patterns and relationships between samples as well as variables. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are employed. These methods are favoured because of their simplicity, but they cannot cope with missing data and it is difficult to incorporate prior knowledge about properties of the variable space into the analysis; this is particularly important in the high-dimensional, sparse datasets typical in geochemistry. In this paper we show how to utilise a block-structured correlation matrix using a modification of a well known non-linear probabilistic visualisation model, the Generative Topographic Mapping (GTM), which can cope with missing data. The block structure supports direct modelling of strongly correlated variables. We show that including prior structural information it is possible to improve both the data visualisation and the model fit. These benefits are demonstrated on artificial data as well as a real geochemical dataset used for oil exploration, where the proposed modifications improved the missing data imputation results by 3 to 13%.