986 resultados para Missing values


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The R-package “compositions”is a tool for advanced compositional analysis. Its basic functionality has seen some conceptual improvement, containing now some facilities to work with and represent ilr bases built from balances, and an elaborated subsys- tem for dealing with several kinds of irregular data: (rounded or structural) zeroes, incomplete observations and outliers. The general approach to these irregularities is based on subcompositions: for an irregular datum, one can distinguish a “regular” sub- composition (where all parts are actually observed and the datum behaves typically) and a “problematic” subcomposition (with those unobserved, zero or rounded parts, or else where the datum shows an erratic or atypical behaviour). Systematic classification schemes are proposed for both outliers and missing values (including zeros) focusing on the nature of irregularities in the datum subcomposition(s). To compute statistics with values missing at random and structural zeros, a projection approach is implemented: a given datum contributes to the estimation of the desired parameters only on the subcompositon where it was observed. For data sets with values below the detection limit, two different approaches are provided: the well-known imputation technique, and also the projection approach. To compute statistics in the presence of outliers, robust statistics are adapted to the characteristics of compositional data, based on the minimum covariance determinant approach. The outlier classification is based on four different models of outlier occur- rence and Monte-Carlo-based tests for their characterization. Furthermore the package provides special plots helping to understand the nature of outliers in the dataset. Keywords: coda-dendrogram, lost values, MAR, missing data, MCD estimator, robustness, rounded zeros

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The paper analyses empirical performance data of five commercial PV-plants in Germany. The purpose was on one side to investigate the weak light performance of the different PV-modules used. On the other hand it was to quantify and compare the shading losses of different PV-array configurations. The importance of this study relies on the fact that even if the behavior under weak light conditions or the shading losses might seem to be a relatively small percentage of the total yearly output; in projects where a performance guarantee is given, these variation can make the difference between meeting or not the conditions.When analyzing the data, a high dispersion was found. To reduce the optical losses and spectral effects, a series of data filters were applied based on the angle of incidence and absolute Air Mass. To compensate for the temperature effects and translate the values to STC (25°C), five different methods were assessed. At the end, the Procedure 2 of IEC 60891 was selected due to its relative simplicity, usage of mostly standard parameters found in datasheets, good accuracy even with missing values, and its potential to improve the results when the complete set of inputs is available.After analyzing the data, the weak light performance of the modules did not show a clear superiority of a certain technology or technology group over the others. Moreover, the uncertainties in the measurements restrictive the conclusiveness of the results.In the partial shading analysis, the landscape mounting of mc-Si PV-modules in free-field showed a significantly better performance than the portrait one. The cross-table string using CIGS modules did not proved the benefits expected and performed actually poorer than a regular one-string-per-table layout. Parallel substrings with CdTe showed a proper functioning and relatively low losses. Among the two product generations of CdTe analyzed, none showed a significantly better performance under partial shadings.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Different data classification algorithms have been developed and applied in various areas to analyze and extract valuable information and patterns from large datasets with noise and missing values. However, none of them could consistently perform well over all datasets. To this end, ensemble methods have been suggested as the promising measures. This paper proposes a novel hybrid algorithm, which is the combination of a multi-objective Genetic Algorithm (GA) and an ensemble classifier. While the ensemble classifier, which consists of a decision tree classifier, an Artificial Neural Network (ANN) classifier, and a Support Vector Machine (SVM) classifier, is used as the classification committee, the multi-objective Genetic Algorithm is employed as the feature selector to facilitate the ensemble classifier to improve the overall sample classification accuracy while also identifying the most important features in the dataset of interest. The proposed GA-Ensemble method is tested on three benchmark datasets, and compared with each individual classifier as well as the methods based on mutual information theory, bagging and boosting. The results suggest that this GA-Ensemble method outperform other algorithms in comparison, and be a useful method for classification and feature selection problems.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background
Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.

Results
One important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.

Conclusion
The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Objective To develop and evaluate the effectiveness of a community behavioural intervention to prevent weight gain and improve health related behaviours in women with young children.
Design Cluster randomised controlled trial.
Setting A community setting in urban Australia. 
Participants 250 adult women with a mean age of 40. 39 years (SD 4.77, range 25-51) and a mean body mass index of 27.82 kg/m2 (SD 5.42, range 18-47) were recruited as clusters through 12 primary (elementary) schools. Intervention Schools were randomly assigned to the intervention or the control. Mothers whose schools fell in the intervention group (n=127) attended four interactive group sessions that involved simple health messages, behaviour change strategies, and group discussion, and received monthly support using mobile telephone text messages for 12 months. The control group (n=123)
attended one non-interactive information session based on population dietary and physical activity guidelines. 
Main outcome measures The main outcome measures were weight change and difference in weight change between the intervention group and the control group at 12 months. Secondary outcomes were changes in serum concentrations of fasting lipids and glucose, and changes in dietary behaviours, physical activity, and self management behaviours.
Results All analyses were adjusted for baseline values and the possible clustering effect. Women in the control group gained weight over the 12 month study period (0.83 kg, 95% confidence interval (CI) 0.12 to 1.54), whereas those in the intervention group lost weight (−0.20 kg, −0.90 to 0.49). The difference in weight change between the intervention group and the control group at 12 months was −1.13 kg (−2.03 to −0.24 kg; P<0.05) on the basis of observed values and −1.11 kg (−2.17 to −0.04) after multiple imputation to account for possible bias created by missing values. Secondary analyses after multiple imputation showed a difference in the intervention group compared with the control group for total cholesterol concentration (−0.35 mmol/l, −0.70 to −0.001), self management behaviours (diet score 0.18, 0.13 to 0.33; physical activity score 0.24, 0.05 to 0.43), and confidence to control weight (0.40, 0.11 to 0.69). Regular self weighing was associated with weight loss in the intervention group only (−1.98 kg, −3.75 to −0.23).
Conclusions Weight gain in women with young children could be prevented using a low intensity self management intervention delivered in a community setting. Self management of health behaviours improved with the intervention. The response rate of 12%, although comparable with that in other community studies, might limit the ability to generalise to other populations.    
Trial registration Australian New Zealand Clinical Trials Registry number ACTRN12608000110381.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

As each user tends to rate a small proportion of available items, the resulted Data Sparsity issue brings significant challenges to the research of recommender systems. This issue becomes even more severe for neighborhood-based collaborative filtering methods, as there are even lower numbers of ratings available in the neighborhood of the query item. In this paper, we aim to address the Data Sparsity issue in the context of the neighborhood-based collaborative filtering. Given the (user, item) query, a set of key ratings are identified, and an auto-adaptive imputation method is proposed to fill the missing values in the set of key ratings. The proposed method can be used with any similarity metrics, such as the Pearson Correlation Coefficient and Cosine-based similarity, and it is theoretically guaranteed to outperform the neighborhood-based collaborative filtering approaches. Results from experiments prove that the proposed method could significantly improve the accuracy of recommendations for neighborhood-based Collaborative Filtering algorithms. © 2012 ACM.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background: This study describes and compares health-related quality of life (HRQOL) of prostate cancer patients who received either radical prostatectomy (nerve-sparing, nsRP, or non-nerve-sparing, nnsRP) or radiotherapy (external RT, brachytherapy, or both combined) for treatment of localised prostate cancer. Methods: The prospective, multicenter cohort study included 529 patients. Questionnaires included the IIEF, QLQ-C30, and PORPUS-P. Data were collected before (baseline), three, six, twelve, and twenty-four months after treatment. Differences between groups' baseline characteristics were assessed; changes over time were analysed with generalised estimating equations (GEE). Missing values were treated with multiple imputation. Further, scores at baseline and end of follow-up were compared to German reference data. Results: The typical time trend was a decrease of average HRQOL three months after treatment followed by (partial) recovery. RP patients experienced considerable impairment in sexual functioning. The covariate-adjusted GEE identified a significant - but not clinically relevant - treatment effect for diarrhoea (b∈=∈7.0 for RT, p∈=∈0.006) and PORPUS-P (b∈=∈2.3 for nsRP, b∈=∈2.2 for RT, p∈=∈0.045) compared to the reference nnsRP. Most of the HRQOL scores were comparable to German norm values. Conclusions: Findings from previous research were reproduced in a specific setting of a patient cohort in the German health care system. According to the principle of evidence-based medicine, this strengthens the messages regarding treatment in prostate cancer and its impacts on patients' health-related quality of life. After adjustment for baseline HRQOL and other covariates, RT patients reported increased symptoms of diarrhoea, and nnsRP patients decreased prostate-specific HRQOL. RP patients experienced considerable impairment in sexual functioning. These differences should be taken into account by physicians when choosing the best therapy for a patient.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The bulk of existing work on the statistical forecasting of air quality is based on either neural networks or linear regressions, which are both subject to important drawbacks. In particular, while neural networks are complicated and prone to in-sample overfitting, linear regressions are highly dependent on the specification of the regression function. The present paper shows how combining linear regression forecasts can be used to circumvent all of these problems. The usefulness of the proposed combination approach is verified using both Monte Carlo simulation and an extensive application to air quality in Bogota, one of the largest and most polluted cities in Latin America. © 2014 Elsevier Ltd.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Electronic Medical Records (EMR) are increasingly used for risk prediction. EMR analysis is complicated by missing entries. There are two reasons - the “primary reason for admission” is included in EMR, but the co-morbidities (other chronic diseases) are left uncoded, and, many zero values in the data are accurate, reflecting that a patient has not accessed medical facilities. A key challenge is to deal with the peculiarities of this data - unlike many other datasets, EMR is sparse, reflecting the fact that patients have some, but not all diseases. We propose a novel model to fill-in these missing values, and use the new representation for prediction of key hospital events. To “fill-in” missing values, we represent the feature-patient matrix as a product of two low rank factors, preserving the sparsity property in the product. Intuitively, the product regularization allows sparse imputation of patient conditions reflecting common comorbidities across patients. We develop a scalable optimization algorithm based on Block coordinate descent method to find an optimal solution. We evaluate the proposed framework on two real world EMR cohorts: Cancer (7000 admissions) and Acute Myocardial Infarction (2652 admissions). Our result shows that the AUC for 3 months admission prediction is improved significantly from (0.741 to 0.786) for Cancer data and (0.678 to 0.724) for AMI data. We also extend the proposed method to a supervised model for predicting of multiple related risk outcomes (e.g. emergency presentations and admissions in hospital over 3, 6 and 12 months period) in an integrated framework. For this model, the AUC averaged over outcomes is improved significantly from (0.768 to 0.806) for Cancer data and (0.685 to 0.748) for AMI data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we tackle the incompleteness of user rating history in the context of collaborative filtering for Top-N recommendations. Previous research ignore a fact that two rating patterns exist in the user × item rating matrix and influence each other. More importantly, their interactive influence characterizes the development of each other, which can consequently be exploited to improve the modelling of rating patterns, especially when the user × item rating matrix is highly incomplete due to the well-known data sparsity issue. This paper proposes a Rating Pattern Subspace to iteratively re-optimize the missing values in each user’s rating history by modelling both the global and the personal rating patterns simultaneously. The basic idea is to project the user × item rating matrix on a low-rank subspace to capture the global rating patterns. Then, the projection of each individual user on the subspace is further optimized according to his/her own rating history and the captured global rating patterns. Finally, the optimized user projections are used to improve the modelling of the global rating patterns. Based on this subspace, we propose a RapSVD-L algorithm for Top-N recommendations. In the experiments, the performance of the proposed method is compared with the state-of-the-art Top-N recommendation methods on two real datasets under various data sparsity levels. The experimental results show that RapSVD-L outperforms the compared algorithms not only on the all items recommendations but also on the long tail item recommendations in terms of accuracy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Este trabalho tem por objetivo avaliar a eficiência do mercado acionário brasileiro a partir de testes estatísticos, para posterior modelagem das séries de retorno das ações, utilizando os modelos ARMA, ARCH, GARCH, Modelo de Decomposição e, por final, VAR. Para este trabalho foram coletados dados intradiários, que são considerados dados de alta freqüência e menos suscetíveis a possíveis alterações na estrutura de mercado, tanto micro como macroeconômicos. Optou-se por trabalhar com dados coletados a cada cinco minutos, devido à baixa liquidez dos ativos no mercado financeiro (que poderia acarretar em dados ausentes para intervalos de tempo inferiores). As séries escolhidas foram: Petrobrás PN, Gerdau PN, Bradesco PN, Vale do Rio Doce PN e o índice Ibovespa, que apresentam grande representatividade do mercado acionário brasileiro para o período analisado. Com base no teste de Dickey-Fuller, verificou-se indícios que o mercado acionário brasileiro possa ser eficiente e, assim foi proposto modelos para as séries de retorno das ações anteriormente citadas.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Multi-factor models constitute a useful tool to explain cross-sectional covariance in equities returns. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide interval forecasts.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents new methodology for making Bayesian inference about dy~ o!s for exponential famiIy observations. The approach is simulation-based _~t> use of ~vlarkov chain Monte Carlo techniques. A yletropolis-Hastings i:U~UnLlllll 1::; combined with the Gibbs sampler in repeated use of an adjusted version of normal dynamic linear models. Different alternative schemes are derived and compared. The approach is fully Bayesian in obtaining posterior samples for state parameters and unknown hyperparameters. Illustrations to real data sets with sparse counts and missing values are presented. Extensions to accommodate for general distributions for observations and disturbances. intervention. non-linear models and rnultivariate time series are outlined.