47 resultados para missing data imputation

em Deakin Research Online - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great successes on dealing with missing values in data sets with homogeneous attributes (their independent attributes are all either continuous or discrete). This paper studies a new setting of missing data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent attributes are of different types), referred to as imputing mixed-attribute data sets. Although many real applications are in this setting, there is no estimator designed for imputing mixed-attribute data sets. This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively. And then, a mixture-kernel-based iterative estimator is advocated to impute mixed-attribute data sets. The proposed method is evaluated with extensive experiments compared with some typical algorithms, and the result demonstrates that the proposed approach is better than these existing imputation methods in terms of classification accuracy and root mean square error (RMSE) at different missing ratios.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

he workshop will firstly provide an overview of the problems associated with missing data within the context of clinical trials and how to minimise these. Missing data will be explored by modeling the impact on a number of datasets. This approach will be invaluable in highlighting how alternative methods for controlling for missing data impact differentially on the interpretation of study findings. Popular strategies involve options based on an assessment of the percentage of missing data. More innovative approaches to the management of missing data (e.g. based upon reliability analyses) will be explored and evaluated and the role of the most popular methods of data management explored in several study designs beyond those of the classic randomised controlled trial. Participants will have the opportunity to appraise and debate existing methods of missing data handling.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Segmentation of individual actions from a stream of human motion is an open problem in computer vision. This paper approaches the problem of segmenting higher-level activities into their component sub-actions using Hidden Markov Models modified to handle missing data in the observation vector. By controlling the use of missing data, action labels can be inferred from the observation vector during inferencing, thus performing segmentation and classification simultaneously. The approach is able to segment both prominent and subtle actions, even when subtle actions are grouped together. The advantage of this method over sliding windows and Viterbi state sequence interrogation is that segmentation is performed as a trainable task, and the temporal relationship between actions is encoded in the model and used as evidence for action labelling.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this work we analyze the key issue of the relationship that should hold between the operators in a family {An} of aggregation operators in order to understand they properly define a consistent whole. Here we extend some of the ideas about stability of a family of aggregation operators into a more general framework, formally defining the notions of i – L and j – R strict stability for families of aggregation operators. The notion of strict stability of order k is introduced as well. Finally, we also present an application of the strict stability conditions to deal with missing data problems in an information aggregation process. For this analysis, we have focused in the weighted mean family and the quasi-arithmetic weighted means families.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: Missing data is a common phenomenon with survey-based research; patterns of missing data may elucidate why participants decline to answer certain questions. Objective: To describe patterns of missing data in the Pediatric Quality of Life and Evaluation of Symptoms Technology (PediQUEST) study, and highlight challenges in asking sensitive research questions. Design: Cross-sectional, survey-based study embedded within a randomized controlled trial. Setting: Three large children's hospitals: Dana-Farber/Boston Children's Cancer and Blood Disorders Center (DF/BCCDC); Children's Hospital of Philadelphia (CHOP); and Seattle Children's Hospital (SCH). Measurements: At the time of their child's enrollment, parents completed the Survey about Caring for Children with Cancer (SCCC), including demographics, perceptions of prognosis, treatment goals, quality of life, and psychological distress. Results: Eighty-six of 104 parents completed surveys (83% response). The proportion of missing data varied by question type. While 14 parents (16%) left demographic fields blank, over half (n=48; 56%) declined to answer at least one question about their child's prognosis, especially life expectancy. The presence of missing data was unrelated to the child's diagnosis, time from progression, time to death, or parent distress (p>0.3 for each). Written explanations in survey margins suggested that addressing a child's life expectancy is particularly challenging for parents. Conclusions and Relevance: Parents of children with cancer commonly refrain from answering questions about their child's prognosis, however, they may be more likely to address general cure likelihood than explicit life expectancy. Understanding acceptability of sensitive questions in survey-based research will foster higher quality palliative care research. © Copyright 2014, Mary Ann Liebert, Inc. 2014.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.

METHODS: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators.

RESULTS: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001).

CONCLUSION: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper describes the integration of missing observation data with hidden Markov models to create a framework that is able to segment and classify individual actions from a stream of human motion using an incomplete 3D human pose estimation. Based on this framework, a model is trained to automatically segment and classify an activity sequence into its constituent subactions during inferencing. This is achieved by introducing action labels into the observation vector and setting these labels as missing data during inferencing, thus forcing the system to infer the probability of each action label. Additionally, missing data provides recognition-level support for occlusions and imperfect silhouette segmentation, permitting the use of a fast (real-time) pose estimation that delegates the burden of handling undetected limbs onto the action recognition system. Findings show that the use of missing data to segment activities is an accurate and elegant approach. Furthermore, action recognition can be accurate even when almost half of the pose feature data is missing due to occlusions, since not all of the pose data is important all of the time.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Researchers strive to optimize data quality in order to ensure that study findings are valid and reliable. In this paper, we describe a data quality control program designed to maximize quality of survey data collected using computer-assisted personal interviews. The quality control program comprised three phases: (1) software development, (2) an interviewer quality control protocol, and (3) a data cleaning and processing protocol. To illustrate the value of the program, we assess its use in the Translating Research in Elder Care Study. We utilize data collected annually for two years from computer-assisted personal interviews with 3004 healthcare aides. Data quality was assessed using both survey and process data. Missing data and data errors were minimal. Mean and median values and standard deviations were within acceptable limits. Process data indicated that in only 3.4% and 4.0% of cases was the interviewer unable to conduct interviews in accordance with the details of the program. Interviewers’ perceptions of interview quality also significantly improved between Years 1 and 2. While this data quality control program was demanding in terms of time and resources, we found that the benefits clearly outweighed the effort required to achieve high-quality data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The exponential increase in data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making. Benefits from effective data integration to the health and medical research community include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. The costs of poor quality processes elevate the risk of erroneous outcomes, an erosion of confidence in the data and the organisations using these data. To date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT) translational to any field of research.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The spectrum nature and heterogeneity within autism spectrum disorders (ASD) pose as a challenge for treatment. Personalisation of syllabus for children with ASD can improve the efficacy of learning by adjusting the number of opportunities and deciding the course of syllabus. We research the data-motivated approach in an attempt to disentangle this heterogeneity for personalisation of syllabus. With the help of technology and a structured syllabus, collecting data while a child with ASD masters the skills is made possible. The performance data collected are, however, growing and contain missing elements based on the pace and the course each child takes while navigating through the syllabus. Bayesian nonparametric methods are known for automatically discovering the number of latent components and their parameters when the model involves higher complexity. We propose a nonparametric Bayesian matrix factorisation model that discovers learning patterns and the way participants associate with them. Our model is built upon the linear Poisson gamma model (LPGM) with an Indian buffet process prior and extended to incorporate data with missing elements. In this paper, for the first time we have presented learning patterns deduced automatically from data mining and machine learning methods using intervention data recorded for over 500 children with ASD. We compare the results with non-negative matrix factorisation and K-means, which being parametric, not only require us to specify the number of learning patterns in advance, but also do not have a principle approach to deal with missing data. The F1 score observed over varying degree of similarity measure (Jaccard Index) suggests that LPGM yields the best outcome. By observing these patterns with additional knowledge regarding the syllabus it may be possible to observe the progress and dynamically modify the syllabus for improved learning.