949 resultados para missing data imputation
Resumo:
We formulate performance assessment as a problem of causal analysis and outline an approach based on the missing data principle for its solution. It is particularly relevant in the context of so-called league tables for educational, health-care and other public-service institutions. The proposed solution avoids comparisons of institutions that have substantially different clientele (intake).
Resumo:
For the last 2 decades, supertree reconstruction has been an active field of research and has seen the development of a large number of major algorithms. Because of the growing popularity of the supertree methods, it has become necessary to evaluate the performance of these algorithms to determine which are the best options (especially with regard to the supermatrix approach that is widely used). In this study, seven of the most commonly used supertree methods are investigated by using a large empirical data set (in terms of number of taxa and molecular markers) from the worldwide flowering plant family Sapindaceae. Supertree methods were evaluated using several criteria: similarity of the supertrees with the input trees, similarity between the supertrees and the total evidence tree, level of resolution of the supertree and computational time required by the algorithm. Additional analyses were also conducted on a reduced data set to test if the performance levels were affected by the heuristic searches rather than the algorithms themselves. Based on our results, two main groups of supertree methods were identified: on one hand, the matrix representation with parsimony (MRP), MinFlip, and MinCut methods performed well according to our criteria, whereas the average consensus, split fit, and most similar supertree methods showed a poorer performance or at least did not behave the same way as the total evidence tree. Results for the super distance matrix, that is, the most recent approach tested here, were promising with at least one derived method performing as well as MRP, MinFlip, and MinCut. The output of each method was only slightly improved when applied to the reduced data set, suggesting a correct behavior of the heuristic searches and a relatively low sensitivity of the algorithms to data set sizes and missing data. Results also showed that the MRP analyses could reach a high level of quality even when using a simple heuristic search strategy, with the exception of MRP with Purvis coding scheme and reversible parsimony. The future of supertrees lies in the implementation of a standardized heuristic search for all methods and the increase in computing power to handle large data sets. The latter would prove to be particularly useful for promising approaches such as the maximum quartet fit method that yet requires substantial computing power.
Resumo:
BACKGROUND: Artemisinin-resistant Plasmodium falciparum has emerged in the Greater Mekong sub-region and poses a major global public health threat. Slow parasite clearance is a key clinical manifestation of reduced susceptibility to artemisinin. This study was designed to establish the baseline values for clearance in patients from Sub-Saharan African countries with uncomplicated malaria treated with artemisinin-based combination therapies (ACTs). METHODS: A literature review in PubMed was conducted in March 2013 to identify all prospective clinical trials (uncontrolled trials, controlled trials and randomized controlled trials), including ACTs conducted in Sub-Saharan Africa, between 1960 and 2012. Individual patient data from these studies were shared with the WorldWide Antimalarial Resistance Network (WWARN) and pooled using an a priori statistical analytical plan. Factors affecting early parasitological response were investigated using logistic regression with study sites fitted as a random effect. The risk of bias in included studies was evaluated based on study design, methodology and missing data. RESULTS: In total, 29,493 patients from 84 clinical trials were included in the analysis, treated with artemether-lumefantrine (n = 13,664), artesunate-amodiaquine (n = 11,337) and dihydroartemisinin-piperaquine (n = 4,492). The overall parasite clearance rate was rapid. The parasite positivity rate (PPR) decreased from 59.7 % (95 % CI: 54.5-64.9) on day 1 to 6.7 % (95 % CI: 4.8-8.7) on day 2 and 0.9 % (95 % CI: 0.5-1.2) on day 3. The 95th percentile of observed day 3 PPR was 5.3 %. Independent risk factors predictive of day 3 positivity were: high baseline parasitaemia (adjusted odds ratio (AOR) = 1.16 (95 % CI: 1.08-1.25); per 2-fold increase in parasite density, P <0.001); fever (>37.5 °C) (AOR = 1.50 (95 % CI: 1.06-2.13), P = 0.022); severe anaemia (AOR = 2.04 (95 % CI: 1.21-3.44), P = 0.008); areas of low/moderate transmission setting (AOR = 2.71 (95 % CI: 1.38-5.36), P = 0.004); and treatment with the loose formulation of artesunate-amodiaquine (AOR = 2.27 (95 % CI: 1.14-4.51), P = 0.020, compared to dihydroartemisinin-piperaquine). CONCLUSIONS: The three ACTs assessed in this analysis continue to achieve rapid early parasitological clearance across the sites assessed in Sub-Saharan Africa. A threshold of 5 % day 3 parasite positivity from a minimum sample size of 50 patients provides a more sensitive benchmark in Sub-Saharan Africa compared to the current recommended threshold of 10 % to trigger further investigation of artemisinin susceptibility.
Resumo:
Les données manquantes sont fréquentes dans les enquêtes et peuvent entraîner d’importantes erreurs d’estimation de paramètres. Ce mémoire méthodologique en sociologie porte sur l’influence des données manquantes sur l’estimation de l’effet d’un programme de prévention. Les deux premières sections exposent les possibilités de biais engendrées par les données manquantes et présentent les approches théoriques permettant de les décrire. La troisième section porte sur les méthodes de traitement des données manquantes. Les méthodes classiques sont décrites ainsi que trois méthodes récentes. La quatrième section contient une présentation de l’Enquête longitudinale et expérimentale de Montréal (ELEM) et une description des données utilisées. La cinquième expose les analyses effectuées, elle contient : la méthode d’analyse de l’effet d’une intervention à partir de données longitudinales, une description approfondie des données manquantes de l’ELEM ainsi qu’un diagnostic des schémas et du mécanisme. La sixième section contient les résultats de l’estimation de l’effet du programme selon différents postulats concernant le mécanisme des données manquantes et selon quatre méthodes : l’analyse des cas complets, le maximum de vraisemblance, la pondération et l’imputation multiple. Ils indiquent (I) que le postulat sur le type de mécanisme MAR des données manquantes semble influencer l’estimation de l’effet du programme et que (II) les estimations obtenues par différentes méthodes d’estimation mènent à des conclusions similaires sur l’effet de l’intervention.
Resumo:
Learning Disability (LD) is a classification including several disorders in which a child has difficulty in learning in a typical manner, usually caused by an unknown factor or factors. LD affects about 15% of children enrolled in schools. The prediction of learning disability is a complicated task since the identification of LD from diverse features or signs is a complicated problem. There is no cure for learning disabilities and they are life-long. The problems of children with specific learning disabilities have been a cause of concern to parents and teachers for some time. The aim of this paper is to develop a new algorithm for imputing missing values and to determine the significance of the missing value imputation method and dimensionality reduction method in the performance of fuzzy and neuro fuzzy classifiers with specific emphasis on prediction of learning disabilities in school age children. In the basic assessment method for prediction of LD, checklists are generally used and the data cases thus collected fully depends on the mood of children and may have also contain redundant as well as missing values. Therefore, in this study, we are proposing a new algorithm, viz. the correlation based new algorithm for imputing the missing values and Principal Component Analysis (PCA) for reducing the irrelevant attributes. After the study, it is found that, the preprocessing methods applied by us improves the quality of data and thereby increases the accuracy of the classifiers. The system is implemented in Math works Software Mat Lab 7.10. The results obtained from this study have illustrated that the developed missing value imputation method is very good contribution in prediction system and is capable of improving the performance of a classifier.
Resumo:
Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster, Laird, and Rubin 1977)---both for the estimation of mixture components and for coping with the missing data.
Resumo:
The R-package “compositions”is a tool for advanced compositional analysis. Its basic functionality has seen some conceptual improvement, containing now some facilities to work with and represent ilr bases built from balances, and an elaborated subsys- tem for dealing with several kinds of irregular data: (rounded or structural) zeroes, incomplete observations and outliers. The general approach to these irregularities is based on subcompositions: for an irregular datum, one can distinguish a “regular” sub- composition (where all parts are actually observed and the datum behaves typically) and a “problematic” subcomposition (with those unobserved, zero or rounded parts, or else where the datum shows an erratic or atypical behaviour). Systematic classification schemes are proposed for both outliers and missing values (including zeros) focusing on the nature of irregularities in the datum subcomposition(s). To compute statistics with values missing at random and structural zeros, a projection approach is implemented: a given datum contributes to the estimation of the desired parameters only on the subcompositon where it was observed. For data sets with values below the detection limit, two different approaches are provided: the well-known imputation technique, and also the projection approach. To compute statistics in the presence of outliers, robust statistics are adapted to the characteristics of compositional data, based on the minimum covariance determinant approach. The outlier classification is based on four different models of outlier occur- rence and Monte-Carlo-based tests for their characterization. Furthermore the package provides special plots helping to understand the nature of outliers in the dataset. Keywords: coda-dendrogram, lost values, MAR, missing data, MCD estimator, robustness, rounded zeros
Resumo:
Background: Dietary intervention studies suggest that flavan-3-ol intake can improve vascular function and reduce the risk of cardiovascular diseases (CVD). However, results from prospective studies failed to show a consistent beneficial effect. Objective: To investigate associations between flavan-3-ol intake and CVD risk in the Norfolk arm of the European Prospective Investigation into Cancer and Nutrition (EPIC-Norfolk). Design: Data was available from 24,885 (11,252 men; 13,633 women) participants, recruited between 1993 and 1997 into the EPIC-Norfolk study. Flavan-3-ol intake was assessed using 7-day food diaries and the FLAVIOLA Flavanol Food Composition database. Missing data for plasma cholesterol and vitamin C were imputed using multiple imputation. Associations between flavan-3-ol intake and blood pressure at baseline were determined using linear regression models. Associations with CVD risk were estimated using Cox regression analyses. Results: Median intake of total flavan-3-ols was 1034 mg/d (range: 0 – 8531 mg/d) for men and 970 mg/d (0 – 6695 mg/d) for women, median intake of flavan-3-ol monomers was 233 mg/d (0 – 3248 mg/d) for men and 217 (0 – 2712 mg/d) for women. There were no consistent associations between flavan-3-ol monomer intake and baseline systolic and diastolic blood pressure (BP). After 286,147 person-years of follow up, there were 8463 cardio-vascular events and 1987 CVD related deaths; no consistent association between flavan-3-ol intake and CVD risk (HR 0.93, 95% CI:0.87; 1.00; Q1 vs Q5) or mortality was observed (HR 0.93, 95% CI: 0.84; 1.04). Conclusions: Flavan-3-ol intake in EPIC-Norfolk is not sufficient to achieve a statistically significant reduction in CVD risk.
Resumo:
Multi-factor models constitute a useful tool to explain cross-sectional covariance in equities returns. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide interval forecasts.
Resumo:
Multi-factor models constitute a use fui tool to explain cross-sectional covariance in equities retums. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide intervaI forecasts.
Resumo:
This paper proposes a filter based on a general regression neural network and a moving average filter, for preprocessing half-hourly load data for short-term multinodal load forecasting, discussed in another paper. Tests made with half-hourly load data from nine New Zealand electrical substations demonstrate that this filter is able to handle noise, missing data and abnormal data. © 2011 IEEE.
Resumo:
Durante o processo de extração do conhecimento em bases de dados, alguns problemas podem ser encontrados como por exemplo, a ausência de determinada instância de um atributo. A ocorrência de tal problemática pode causar efeitos danosos nos resultados finais do processo, pois afeta diretamente a qualidade dos dados a ser submetido a um algoritmo de aprendizado de máquina. Na literatura, diversas propostas são apresentadas a fim de contornar tal dano, dentre eles está a de imputação de dados, a qual estima um valor plausível para substituir o ausente. Seguindo essa área de solução para o problema de valores ausentes, diversos trabalhos foram analisados e algumas observações foram realizadas como, a pouca utilização de bases sintéticas que simulem os principais mecanismos de ausência de dados e uma recente tendência a utilização de algoritmos bio-inspirados como tratamento do problema. Com base nesse cenário, esta dissertação apresenta um método de imputação de dados baseado em otimização por enxame de partículas, pouco explorado na área, e o aplica para o tratamento de bases sinteticamente geradas, as quais consideram os principais mecanismos de ausência de dados, MAR, MCAR e NMAR. Os resultados obtidos ao comprar diferentes configurações do método à outros dois conhecidos na área (KNNImpute e SVMImpute) são promissores para sua utilização na área de tratamento de valores ausentes uma vez que alcançou os melhores valores na maioria dos experimentos realizados.
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
[EN] The information provided by the International Commission for the Conservation of Atlantic Tunas (ICCAT) on captures of skipjack tuna (Katsuwonus pelamis) in the central-east Atlantic has a number of limitations, such as gaps in the statistics for certain fleets and the level of spatiotemporal detail at which catches are reported. As a result, the quality of these data and their effectiveness for providing management advice is limited. In order to reconstruct missing spatiotemporal data of catches, the present study uses Data INterpolating Empirical Orthogonal Functions (DINEOF), a technique for missing data reconstruction, applied here for the first time to fisheries data. DINEOF is based on an Empirical Orthogonal Functions decomposition performed with a Lanczos method. DINEOF was tested with different amounts of missing data, intentionally removing values from 3.4% to 95.2% of data loss, and then compared with the same data set with no missing data. These validation analyses show that DINEOF is a reliable methodological approach of data reconstruction for the purposes of fishery management advice, even when the amount of missing data is very high.