557 resultados para Missing Data
Resumo:
This paper studies the missing covariate problem which is often encountered in survival analysis. Three covariate imputation methods are employed in the study, and the effectiveness of each method is evaluated within the hazard prediction framework. Data from a typical engineering asset is used in the case study. Covariate values in some time steps are deliberately discarded to generate an incomplete covariate set. It is found that although the mean imputation method is simpler than others for solving missing covariate problems, the results calculated by it can differ largely from the real values of the missing covariates. This study also shows that in general, results obtained from the regression method are more accurate than those of the mean imputation method but at the cost of a higher computational expensive. Gaussian Mixture Model (GMM) method is found to be the most effective method within these three in terms of both computation efficiency and predication accuracy.
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.
Resumo:
This thesis takes a new data mining approach for analyzing road/crash data by developing models for the whole road network and generating a crash risk profile. Roads with an elevated crash risk due to road surface friction deficit are identified. The regression tree model, predicting road segment crash rate, is applied in a novel deployment coined regression tree extrapolation that produces a skid resistance/crash rate curve. Using extrapolation allows the method to be applied across the network and cope with the high proportion of missing road surface friction values. This risk profiling method can be applied in other domains.
Resumo:
Recently, vision-based systems have been deployed in professional sports to track the ball and players to enhance analysis of matches. Due to their unobtrusive nature, vision-based approaches are preferred to wearable sensors (e.g. GPS or RFID sensors) as it does not require players or balls to be instrumented prior to matches. Unfortunately, in continuous team sports where players need to be tracked continuously over long-periods of time (e.g. 35 minutes in field-hockey or 45 minutes in soccer), current vision-based tracking approaches are not reliable enough to provide fully automatic solutions. As such, human intervention is required to fix-up missed or false detections. However, in instances where a human can not intervene due to the sheer amount of data being generated - this data can not be used due to the missing/noisy data. In this paper, we investigate two representations based on raw player detections (and not tracking) which are immune to missed and false detections. Specifically, we show that both team occupancy maps and centroids can be used to detect team activities, while the occupancy maps can be used to retrieve specific team activities. An evaluation on over 8 hours of field hockey data captured at a recent international tournament demonstrates the validity of the proposed approach.
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.
Resumo:
Purpose Potential positive associations between youth physical activity and wellness scores could emphasize the value of youth physical activity engagement and promotion interventions, beyond the many established physiological and psychological benefits of increased physical activity. The purpose of this study was to explore the associations between adolescents' self-reported physical activity and wellness. Methods This investigation included 493 adolescents (165 males and 328 females) aged between 12 and 15 years. The participants were recruited from six secondary schools of varying socioeconomic status within a metropolitan area. Students were administered the Five-Factor Wellness Inventory and the International Physical Activity Questionnaire for Adolescents to assess both wellness and physical activity, respectively. Results Data indicated that significant associations between physical activity and wellness existed. Self-reported physical activity was shown to be positively associated with four dimensions including friendship, gender identity, spirituality, and exercise—the higher order factor physical self and total wellness, and negatively associated with self-care, self-worth, love, and cultural identity. Conclusion This study suggests that relationships exist between self-reported physical activity and various elements of wellness. Future research should use controlled trials of physical activity and wellness to establish causal links among youth populations. Understanding the nature of these relationships, including causality, has implications for the justification of youth physical activity promotion interventions and the development of youth physical activity engagement programs.
Resumo:
This paper proposes a method, based on polychotomous discrete choice methods, to impute a continuous measure of income when only a bracketed measure of income is available and for only a subset of the obsevations. The method is shown to perform well with CP5 data. © 1991.
Resumo:
When crystallization screening is conducted many outcomes are observed but typically the only trial recorded in the literature is the condition that yielded the crystal(s) used for subsequent diffraction studies. The initial hit that was optimized and the results of all the other trials are lost. These missing results contain information that would be useful for an improved general understanding of crystallization. This paper provides a report of a crystallization data exchange (XDX) workshop organized by several international large-scale crystallization screening laboratories to discuss how this information may be captured and utilized. A group that administers a significant fraction of the worlds crystallization screening results was convened, together with chemical and structural data informaticians and computational scientists who specialize in creating and analysing large disparate data sets. The development of a crystallization ontology for the crystallization community was proposed. This paper (by the attendees of the workshop) provides the thoughts and rationale leading to this conclusion. This is brought to the attention of the wider audience of crystallographers so that they are aware of these early efforts and can contribute to the process going forward. © 2012 International Union of Crystallography All rights reserved.
Resumo:
Due to their unobtrusive nature, vision-based approaches to tracking sports players have been preferred over wearable sensors as they do not require the players to be instrumented for each match. Unfortunately however, due to the heavy occlusion between players, variation in resolution and pose, in addition to fluctuating illumination conditions, tracking players continuously is still an unsolved vision problem. For tasks like clustering and retrieval, having noisy data (i.e. missing and false player detections) is problematic as it generates discontinuities in the input data stream. One method of circumventing this issue is to use an occupancy map, where the field is discretised into a series of zones and a count of player detections in each zone is obtained. A series of frames can then be concatenated to represent a set-play or example of team behaviour. A problem with this approach though is that the compressibility is low (i.e. the variability in the feature space is incredibly high). In this paper, we propose the use of a bilinear spatiotemporal basis model using a role representation to clean-up the noisy detections which operates in a low-dimensional space. To evaluate our approach, we used a fully instrumented field-hockey pitch with 8 fixed high-definition (HD) cameras and evaluated our approach on approximately 200,000 frames of data from a state-of-the-art real-time player detector and compare it to manually labeled data.
Resumo:
Water quality data are often collected at different sites over time to improve water quality management. Water quality data usually exhibit the following characteristics: non-normal distribution, presence of outliers, missing values, values below detection limits (censored), and serial dependence. It is essential to apply appropriate statistical methodology when analyzing water quality data to draw valid conclusions and hence provide useful advice in water management. In this chapter, we will provide and demonstrate various statistical tools for analyzing such water quality data, and will also introduce how to use a statistical software R to analyze water quality data by various statistical methods. A dataset collected from the Susquehanna River Basin will be used to demonstrate various statistical methods provided in this chapter. The dataset can be downloaded from website http://www.srbc.net/programs/CBP/nutrientprogram.htm.