1000 resultados para imbalance data


Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we propose, develop, and test a new single-feature evaluator called Significant Proportion of Target Instances (SPTI) to handle the direct-marketing data with the class imbalance problem. The SPTI feature evaluator demonstrates its stability and outstanding performance through empirical experiments in which the real- orld customer data of an e-recruitment firm are used. This research demonstrates that the feature selection using SPTI successfully improves the classifier’s performance in terms of two practical performance metrics. Additionally, we show that it outperforms other well-known feature selection methods and state-of-the-art remedies to the class-imbalance problem. Practically, the findings, when used with the classification model, will help telemarketers to better understand their customers.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biplot graphics are widely employed in the study of the genotypeenvironment interactions, but they are only a graphical tool without a statistical hypothesis test. The singular values and scores (singular vectors) used in biplots correspond to specific estimates of its parameters, and the use of uncertainty measures may lead to different conclusions from those provided by a simple visual evaluation. The aim of this work is to estimate the genotype-environment interactions, using AMMI analysis, through Bayesian approach. Therefore the credibility intervals can be used for decision-making in different situations of analyses. It allows to verify the consistency of the selection and recommendation of cultivars. Two analyses were performed. The first analysis looked into 10 regular commercial hybrids and all possible 45 hybrids obtained from them. They were assessed in 15 locations. The second analysis evaluated 28 hybrids in 35 different environments, with imbalance data. The ellipses were grouped according to the standard of interaction in the biplot. The AMMI analysis with a Bayesian approach proved to be a complete analysis of stability and adaptability, which provides important information that may help the breeder in their decisions. The regions of credibility, built in the biplots, allow to perform an accurate selection and a precise genotype recommendation, with a level of credibility. Genotypes and environments can be grouped according to the existing interaction pattern, which makes possible to formulate specific recommendations. Moreover the environments can be evaluated, in order to find out which ones contribute similarly to the interaction and those to be discarted. The method makes possible to deal with imbalanced data in a natural way, showing efficiency for multienvironment trials. The prediction takes into account instability and the interaction standard of the observed data, in order to establish a direct comparison between genotypes of both 1st and 2nd seasons.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

By analyzing a comprehensive dataset on transport transactions in Japan, we describe a directional imbalance in freight rates by transport mode and examine its potential sources, such as economies of density and directionally imbalanced transport flow. There are certain numbers of observed links which show asymmetric transport costs. Instrumental variable analysis is used to show that economies of density account for deviation from symmetric freight rates between prefectures. Our results show that a 10% increase in outbound transport flow relative to inbound transport flow leads to a 2.1% decrease in outbound freight rate relative to inbound freight rate.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Imbalance is not only a direct major cause of downtime in wind turbines, but also accelerates the degradation of neighbouring and downstream components (e.g. main bearing, generator). Along with detection, the imbalance quantification is also essential as some residual imbalance always exist even in a healthy turbine. Three different commonly used sensor technologies (vibration, acoustic emission and electrical measurements) are investigated in this work to verify their sensitivity to different imbalance grades. This study is based on data obtained by experimental tests performed on a small scale wind turbine drive train test-rig for different shaft speeds and imbalance levels. According to the analysis results, electrical measurements seem to be the most suitable for tracking the development of imbalance.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics. © 2013 Springer-Verlag.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

IMPORTANCE Systematic reviews and meta-analyses of individual participant data (IPD) aim to collect, check, and reanalyze individual-level data from all studies addressing a particular research question and are therefore considered a gold standard approach to evidence synthesis. They are likely to be used with increasing frequency as current initiatives to share clinical trial data gain momentum and may be particularly important in reviewing controversial therapeutic areas.

OBJECTIVE To develop PRISMA-IPD as a stand-alone extension to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Statement, tailored to the specific requirements of reporting systematic reviews and meta-analyses of IPD. Although developed primarily for reviews of randomized trials, many items will apply in other contexts, including reviews of diagnosis and prognosis.

DESIGN Development of PRISMA-IPD followed the EQUATOR Network framework guidance and used the existing standard PRISMA Statement as a starting point to draft additional relevant material. A web-based survey informed discussion at an international workshop that included researchers, clinicians, methodologists experienced in conducting systematic reviews and meta-analyses of IPD, and journal editors. The statement was drafted and iterative refinements were made by the project, advisory, and development groups. The PRISMA-IPD Development Group reached agreement on the PRISMA-IPD checklist and flow diagram by consensus.

FINDINGS Compared with standard PRISMA, the PRISMA-IPD checklist includes 3 new items that address (1) methods of checking the integrity of the IPD (such as pattern of randomization, data consistency, baseline imbalance, and missing data), (2) reporting any important issues that emerge, and (3) exploring variation (such as whether certain types of individual benefit more from the intervention than others). A further additional item was created by reorganization of standard PRISMA items relating to interpreting results. Wording was modified in 23 items to reflect the IPD approach.

CONCLUSIONS AND RELEVANCE PRISMA-IPD provides guidelines for reporting systematic reviews and meta-analyses of IPD.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: We examined whether higher effort-reward imbalance (ERI) and lower job control are associated with exit from the labour market. 

Methods: There were 1263 participants aged 50-74 years from the English Longitudinal Study on Ageing with data on working status and work-related psychosocial factors at baseline (wave 2; 2004-2005), and working status at follow-up (wave 5; 2010-2011). Psychosocial factors at work were assessed using a short validated version of ERI and job control. An allostatic load index was formed using 13 biological parameters. Depressive symptoms were measured using the Center for Epidemiologic Studies Depression Scale. Exit from the labour market was defined as not working in the labour market when 61 years old or younger in 2010-2011. 

Results: Higher ERI OR=1.62 (95% CI 1.01 to 2.61, p=0.048) predicted exit from the labour market independent of age, sex, education, occupational class, allostatic load and depression. Job control OR=0.60 (95% CI 0.42 to 0.85, p=0.004) was associated with exit from the labour market independent of age, sex, education, occupation and depression. The association of higher effort OR=1.32 (95% CI 1.01 to 1.73, p=0.045) with exit from the labour market was independent of age, sex and depression but attenuated to non-significance when additionally controlling for socioeconomic measures. Reward was not related to exit from the labour market. 

Conclusions: Stressful work conditions can be a risk for exiting the labour market before the age of 61 years. Neither socioeconomic position nor allostatic load and depressive symptoms seem to explain this association.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recent research has shown that Lighthill–Ford spontaneous gravity wave generation theory, when applied to numerical model data, can help predict areas of clear-air turbulence. It is hypothesized that this is the case because spontaneously generated atmospheric gravity waves may initiate turbulence by locally modifying the stability and wind shear. As an improvement on the original research, this paper describes the creation of an ‘operational’ algorithm (ULTURB) with three modifications to the original method: (1) extending the altitude range for which the method is effective downward to the top of the boundary layer, (2) adding turbulent kinetic energy production from the environment to the locally produced turbulent kinetic energy production, and, (3) transforming turbulent kinetic energy dissipation to eddy dissipation rate, the turbulence metric becoming the worldwide ‘standard’. In a comparison of ULTURB with the original method and with the Graphical Turbulence Guidance second version (GTG2) automated procedure for forecasting mid- and upper-level aircraft turbulence ULTURB performed better for all turbulence intensities. Since ULTURB, unlike GTG2, is founded on a self-consistent dynamical theory, it may offer forecasters better insight into the causes of the clear-air turbulence and may ultimately enhance its predictability.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Combining satellite data, atmospheric reanalyses and climate model simulations, variability in the net downward radiative flux imbalance at the top of Earth's atmosphere (N) is reconstructed and linked to recent climate change. Over the 1985-1999 period mean N (0.34 ± 0.67 Wm–2) is lower than for the 2000-2012 period (0.62 ± 0.43 Wm–2, uncertainties at 90% confidence level) despite the slower rate of surface temperature rise since 2000. While the precise magnitude of N remains uncertain, the reconstruction captures interannual variability which is dominated by the eruption of Mt. Pinatubo in 1991 and the El Niño Southern Oscillation. Monthly deseasonalized interannual variability in N generated by an ensemble of 9 climate model simulations using prescribed sea surface temperature and radiative forcings and from the satellite-based reconstruction is significantly correlated (r ∼ 0.6) over the 1985-2012 period.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Compared with conventional two-class learning schemes, one-class classification simply uses a single class in the classifier training phase. Applying one-class classification to learn from unbalanced data set is regarded as the recognition based learning and has shown to have the potential of achieving better performance. Similar to twoclass learning, parameter selection is a significant issue, especially when the classifier is sensitive to the parameters. For one-class learning scheme with the kernel function, such as one-class Support Vector Machine and Support Vector Data Description, besides the parameters involved in the kernel, there is another one-class specific parameter: the rejection rate v. In this paper, we proposed a general framework to involve the majority class in solving the parameter selection problem. In this framework, we first use the minority target class for training in the one-class classification stage; then we use both minority and majority class for estimating the generalization performance of the constructed classifier. This generalization performance is set as the optimization criteria. We employed the Grid search and Experiment Design search to attain various parameter settings. Experiments on UCI and Reuters text data show that the parameter optimized one-class classifiers outperform all the standard one-class learning schemes we examined.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Compared with conventional two-class learning schemes, one-class classification simply uses a single class for training purposes. Applying one-class classification to the minorities in an imbalanced data has been shown to achieve better performance than the two-class one. In this paper, in order to make the best use of all the available information during the learning procedure, we propose a general framework which first uses the minority class for training in the one-class classification stage; and then uses both minority and majority class for estimating the generalization performance of the constructed classifier. Based upon this generalization performance measurement, parameter search algorithm selects the best parameter settings for this classifier. Experiments on UCI and Reuters text data show that one-class SVM embedded in this framework achieves much better performance than the standard one-class SVM alone and other learning schemes, such as one-class Naive Bayes, one-class nearest neighbour and neural network.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background
Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.

Results
One important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.

Conclusion
The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With the arrival of big data era, the Internet traffic is growing exponentially. A wide variety of applications arise on the Internet and traffic classification is introduced to help people manage the massive applications on the Internet for security monitoring and quality of service purposes. A large number of Machine Learning (ML) algorithms are introduced to deal with traffic classification. A significant challenge to the classification performance comes from imbalanced distribution of data in traffic classification system. In this paper, we proposed an Optimised Distance-based Nearest Neighbor (ODNN), which has the capability of improving the classification performance of imbalanced traffic data. We analyzed the proposed ODNN approach and its performance benefit from both theoretical and empirical perspectives. A large number of experiments were implemented on the real-world traffic dataset. The results show that the performance of “small classes” can be improved significantly even only with small number of training data and the performance of “large classes” remains stable.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we examine whether order imbalances can predict the Chinese stock market returns. We use intraday data, a panel data predictive regression model that accounts for persistent and endogenous order imbalances and cross-sectional dependence in returns, and show that order imbalances predict stock returns from 1-minute trading to 90-minute trading. On the basis of this predictability evidence using multiple trading strategies we show that profits persist during the day. These results imply that a source of Chinese market inefficiency is order imbalances.