910 resultados para Missing Data
Resumo:
There are many situations where input feature vectors are incomplete and methods to tackle the problem have been studied for a long time. A commonly used procedure is to replace each missing value with an imputation. This paper presents a method to perform categorical missing data imputation from numerical and categorical variables. The imputations are based on Simpson’s fuzzy min-max neural networks where the input variables for learning and classification are just numerical. The proposed method extends the input to categorical variables by introducing new fuzzy sets, a new operation and a new architecture. The procedure is tested and compared with others using opinion poll data.
Resumo:
Exploratory analysis of data in all sciences seeks to find common patterns to gain insights into the structure and distribution of the data. Typically visualisation methods like principal components analysis are used but these methods are not easily able to deal with missing data nor can they capture non-linear structure in the data. One approach to discovering complex, non-linear structure in the data is through the use of linked plots, or brushing, while ignoring the missing data. In this technical report we discuss a complementary approach based on a non-linear probabilistic model. The generative topographic mapping enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate far more structure than a two dimensional principal components plot could, and deal at the same time with missing data. We show that using the generative topographic mapping provides us with an optimal method to explore the data while being able to replace missing values in a dataset, particularly where a large proportion of the data is missing.
Resumo:
Exploratory analysis of data seeks to find common patterns to gain insights into the structure and distribution of the data. In geochemistry it is a valuable means to gain insights into the complicated processes making up a petroleum system. Typically linear visualisation methods like principal components analysis, linked plots, or brushing are used. These methods can not directly be employed when dealing with missing data and they struggle to capture global non-linear structures in the data, however they can do so locally. This thesis discusses a complementary approach based on a non-linear probabilistic model. The generative topographic mapping (GTM) enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate more structure than a two dimensional principal components plot. The model can deal with uncertainty, missing data and allows for the exploration of the non-linear structure in the data. In this thesis a novel approach to initialise the GTM with arbitrary projections is developed. This makes it possible to combine GTM with algorithms like Isomap and fit complex non-linear structure like the Swiss-roll. Another novel extension is the incorporation of prior knowledge about the structure of the covariance matrix. This extension greatly enhances the modelling capabilities of the algorithm resulting in better fit to the data and better imputation capabilities for missing data. Additionally an extensive benchmark study of the missing data imputation capabilities of GTM is performed. Further a novel approach, based on missing data, will be introduced to benchmark the fit of probabilistic visualisation algorithms on unlabelled data. Finally the work is complemented by evaluating the algorithms on real-life datasets from geochemical projects.
Resumo:
Exploratory analysis of petroleum geochemical data seeks to find common patterns to help distinguish between different source rocks, oils and gases, and to explain their source, maturity and any intra-reservoir alteration. However, at the outset, one is typically faced with (a) a large matrix of samples, each with a range of molecular and isotopic properties, (b) a spatially and temporally unrepresentative sampling pattern, (c) noisy data and (d) often, a large number of missing values. This inhibits analysis using conventional statistical methods. Typically, visualisation methods like principal components analysis are used, but these methods are not easily able to deal with missing data nor can they capture non-linear structure in the data. One approach to discovering complex, non-linear structure in the data is through the use of linked plots, or brushing, while ignoring the missing data. In this paper we introduce a complementary approach based on a non-linear probabilistic model. Generative topographic mapping enables the visualisation of the effects of very many variables on a single plot, while also dealing with missing data. We show how using generative topographic mapping also provides an optimal method with which to replace missing values in two geochemical datasets, particularly where a large proportion of the data is missing.
Resumo:
One of the main challenges of classifying clinical data is determining how to handle missing features. Most research favours imputing of missing values or neglecting records that include missing data, both of which can degrade accuracy when missing values exceed a certain level. In this research we propose a methodology to handle data sets with a large percentage of missing values and with high variability in which particular data are missing. Feature selection is effected by picking variables sequentially in order of maximum correlation with the dependent variable and minimum correlation with variables already selected. Classification models are generated individually for each test case based on its particular feature set and the matching data values available in the training population. The method was applied to real patients' anonymous mental-health data where the task was to predict the suicide risk judgement clinicians would give for each patient's data, with eleven possible outcome classes: zero to ten, representing no risk to maximum risk. The results compare favourably with alternative methods and have the advantage of ensuring explanations of risk are based only on the data given, not imputed data. This is important for clinical decision support systems using human expertise for modelling and explaining predictions.
Resumo:
Multivariate normal distribution is commonly encountered in any field, a frequent issue is the missing values in practice. The purpose of this research was to estimate the parameters in three-dimensional covariance permutation-symmetric normal distribution with complete data and all possible patterns of incomplete data. In this study, MLE with missing data were derived, and the properties of the MLE as well as the sampling distributions were obtained. A Monte Carlo simulation study was used to evaluate the performance of the considered estimators for both cases when ρ was known and unknown. All results indicated that, compared to estimators in the case of omitting observations with missing data, the estimators derived in this article led to better performance. Furthermore, when ρ was unknown, using the estimate of ρ would lead to the same conclusion.
Resumo:
Standardised time series of fishery catch rates require collations of fishing power data on vessel characteristics. Linear mixed models were used to quantify fishing power trends and study the effect of missing data encountered when relying on commercial logbooks. For this, Australian eastern king prawn (Melicertus plebejus) harvests were analysed with historical (from vessel surveys) and current (from commercial logbooks) vessel data. Between 1989 and 2010, fishing power increased up to 76%. To date, both forward-filling and, alternatively, omitting records with missing vessel information from commercial logbooks produce broadly similar fishing power increases and standardised catch rates, due to the strong influence of years with complete vessel data (16 out of 23 years of data). However, if gaps in vessel information had not originated randomly and skippers from the most efficient vessels were the most diligent at filling in logbooks, considerable errors would be introduced. Also, the buffering effect of complete years would be short lived as years with missing data accumulate. Given ongoing changes in fleet profile with high-catching vessels fishing proportionately more of the fleet’s effort, compliance with logbook completion, or alternatively ongoing vessel gear surveys, is required for generating accurate estimates of fishing power and standardised catch rates.
Resumo:
This paper proposes a probabilistic principal component analysis (PCA) approach applied to islanding detection study based on wide area PMU data. The increasing probability of uncontrolled islanding operation, according to many power system operators, is one of the biggest concerns with a large penetration of distributed renewable generation. The traditional islanding detection methods, such as RoCoF and vector shift, are however extremely sensitive and may result in many unwanted trips. The proposed probabilistic PCA aims to improve islanding detection accuracy and reduce the risk of unwanted tripping based on PMU measurements, while addressing a practical issue on missing data. The reliability and accuracy of the proposed probabilistic PCA approach are demonstrated using real data recorded in the UK power system by the OpenPMU project. The results show that the proposed methods can detect islanding accurately, without being falsely triggered by generation trips, even in the presence of missing values.
Resumo:
OBJECTIVES The purpose of the study was to provide empirical evidence about the reporting of methodology to address missing outcome data and the acknowledgement of their impact in Cochrane systematic reviews in the mental health field. METHODS Systematic reviews published in the Cochrane Database of Systematic Reviews after January 1, 2009 by three Cochrane Review Groups relating to mental health were included. RESULTS One hundred ninety systematic reviews were considered. Missing outcome data were present in at least one included study in 175 systematic reviews. Of these 175 systematic reviews, 147 (84%) accounted for missing outcome data by considering a relevant primary or secondary outcome (e.g., dropout). Missing outcome data implications were reported only in 61 (35%) systematic reviews and primarily in the discussion section by commenting on the amount of the missing outcome data. One hundred forty eligible meta-analyses with missing data were scrutinized. Seventy-nine (56%) of them had studies with total dropout rate between 10 and 30%. One hundred nine (78%) meta-analyses reported to have performed intention-to-treat analysis by including trials with imputed outcome data. Sensitivity analysis for incomplete outcome data was implemented in less than 20% of the meta-analyses. CONCLUSIONS Reporting of the techniques for handling missing outcome data and their implications in the findings of the systematic reviews are suboptimal.
Resumo:
Missing outcome data are common in clinical trials and despite a well-designed study protocol, some of the randomized participants may leave the trial early without providing any or all of the data, or may be excluded after randomization. Premature discontinuation causes loss of information, potentially resulting in attrition bias leading to problems during interpretation of trial findings. The causes of information loss in a trial, known as mechanisms of missingness, may influence the credibility of the trial results. Analysis of trials with missing outcome data should ideally be handled with intention to treat (ITT) rather than per protocol (PP) analysis. However, true ITT analysis requires appropriate assumptions and imputation of missing data. Using a worked example from a published dental study, we highlight the key issues associated with missing outcome data in clinical trials, describe the most recognized approaches to handling missing outcome data, and explain the principles of ITT and PP analysis.
Resumo:
The purpose of this study is to investigate the effects of predictor variable correlations and patterns of missingness with dichotomous and/or continuous data in small samples when missing data is multiply imputed. Missing data of predictor variables is multiply imputed under three different multivariate models: the multivariate normal model for continuous data, the multinomial model for dichotomous data and the general location model for mixed dichotomous and continuous data. Subsequent to the multiple imputation process, Type I error rates of the regression coefficients obtained with logistic regression analysis are estimated under various conditions of correlation structure, sample size, type of data and patterns of missing data. The distributional properties of average mean, variance and correlations among the predictor variables are assessed after the multiple imputation process. ^ For continuous predictor data under the multivariate normal model, Type I error rates are generally within the nominal values with samples of size n = 100. Smaller samples of size n = 50 resulted in more conservative estimates (i.e., lower than the nominal value). Correlation and variance estimates of the original data are retained after multiple imputation with less than 50% missing continuous predictor data. For dichotomous predictor data under the multinomial model, Type I error rates are generally conservative, which in part is due to the sparseness of the data. The correlation structure for the predictor variables is not well retained on multiply-imputed data from small samples with more than 50% missing data with this model. For mixed continuous and dichotomous predictor data, the results are similar to those found under the multivariate normal model for continuous data and under the multinomial model for dichotomous data. With all data types, a fully-observed variable included with variables subject to missingness in the multiple imputation process and subsequent statistical analysis provided liberal (larger than nominal values) Type I error rates under a specific pattern of missing data. It is suggested that future studies focus on the effects of multiple imputation in multivariate settings with more realistic data characteristics and a variety of multivariate analyses, assessing both Type I error and power. ^
Resumo:
This paper argues for a renewed focus on statistical reasoning in the elementary school years, with opportunities for children to engage in data modeling. Data modeling involves investigations of meaningful phenomena, deciding what is worthy of attention, and then progressing to organizing, structuring, visualizing, and representing data. Reported here are some findings from a two-part activity (Baxter Brown’s Picnic and Planning a Picnic) implemented at the end of the second year of a current three-year longitudinal study (grade levels 1-3). Planning a Picnic was also implemented in a grade 7 class to provide an opportunity for the different age groups to share their products. Addressed here are the grade 2 children’s predictions for missing data in Baxter Brown’s Picnic, the questions posed and representations created by both grade levels in Planning a Picnic, and the metarepresentational competence displayed in the grade levels’ sharing of their products for Planning a Picnic.
Resumo:
Road surface skid resistance has been shown to have a strong relationship to road crash risk, however, applying the current method of using investigatory levels to identify crash prone roads is problematic as they may fail in identifying risky roads outside of the norm. The proposed method analyses a complex and formerly impenetrable volume of data from roads and crashes using data mining. This method rapidly identifies roads with elevated crash-rate, potentially due to skid resistance deficit, for investigation. A hypothetical skid resistance/crash risk curve is developed for each road segment, driven by the model deployed in a novel regression tree extrapolation method. The method potentially solves the problem of missing skid resistance values which occurs during network-wide crash analysis, and allows risk assessment of the major proportion of roads without skid resistance values.
Resumo:
NeEstimator v2 is a completely revised and updated implementation of software that produces estimates of contemporary effective population size, using several different methods and a single input file. NeEstimator v2 includes three single-sample estimators (updated versions of the linkage disequilibrium and heterozygote-excess methods, and a new method based on molecular coancestry), as well as the two-sample (moment-based temporal) method. New features include the following: (i) an improved method for accounting for missing data; (ii) options for screening out rare alleles; (iii) confidence intervals for all methods; (iv) the ability to analyse data sets with large numbers of genetic markers (10000 or more); (v) options for batch processing large numbers of different data sets, which will facilitate cross-method comparisons using simulated data; and (vi) correction for temporal estimates when individuals sampled are not removed from the population (Plan I sampling). The user is given considerable control over input data and composition, and format of output files. The freely available software has a new JAVA interface and runs under MacOS, Linux and Windows.