1000 resultados para imbalanced learning


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The majority of multi-class pattern classification techniques are proposed for learning from balanced datasets. However, in several real-world domains, the datasets have imbalanced data distribution, where some classes of data may have few training examples compared for other classes. In this paper we present our research in learning from imbalanced multi-class data and propose a new approach, named Multi-IM, to deal with this problem. Multi-IM derives its fundamentals from the probabilistic relational technique (PRMs-IM), designed for learning from imbalanced relational data for the two-class problem. Multi-IM extends PRMs-IM to a generalized framework for multi-class imbalanced learning for both relational and non-relational domains.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The severe class distribution shews the presence of underrepresented data, which has great effects on the performance of learning algorithm, is still a challenge of data mining and machine learning. Lots of researches currently focus on experimental comparison of the existing re-sampling approaches. We believe it requires new ways of constructing better algorithms to further balance and analyse the data set. This paper presents a Fuzzy-based Information Decomposition oversampling (FIDoS) algorithm used for handling the imbalanced data. Generally speaking, this is a new way of addressing imbalanced learning problems from missing data perspective. First, we assume that there are missing instances in the minority class that result in the imbalanced dataset. Then the proposed algorithm which takes advantages of fuzzy membership function is used to transfer information to the missing minority class instances. Finally, the experimental results demonstrate that the proposed algorithm is more practical and applicable compared to sampling techniques.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics. © 2013 Springer-Verlag.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Traditional learning techniques learn from flat data files with the assumption that each class has a similar number of examples. However, the majority of real-world data are stored as relational systems with imbalanced data distribution, where one class of data is over-represented as compared with other classes. We propose to extend a relational learning technique called Probabilistic Relational Models (PRMs) to deal with the imbalanced class problem. We address learning from imbalanced relational data using an ensemble of PRMs and propose a new model: the PRMs-IM. We show the performance of PRMs-IM on a real university relational database to identify students at risk.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Learning from imbalanced data is a challenging task in a wide range of applications, which attracts significant research efforts from machine learning and data mining community. As a natural approach to this issue, oversampling balances the training samples through replicating existing samples or synthesizing new samples. In general, synthesization outperforms replication by supplying additional information on the minority class. However, the additional information needs to follow the same normal distribution of the training set, which further constrains the new samples within the predefined range of training set. In this paper, we present the Wiener process oversampling (WPO) technique that brings the physics phenomena into sample synthesization. WPO constructs a robust decision region by expanding the attribute ranges in training set while keeping the same normal distribution. The satisfactory performance of WPO can be achieved with much lower computing complexity. In addition, by integrating WPO with ensemble learning, the WPOBoost algorithm outperformsmany prevalent imbalance learning solutions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

There is an increasing interest in the application of Evolutionary Algorithms (EAs) to induce classification rules. This hybrid approach can benefit areas where classical methods for rule induction have not been very successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when one or more classes heavily outnumber other classes. Frequently, classical machine learning (ML) classifiers are not able to learn in the presence of imbalanced data sets, inducing classification models that always predict the most numerous classes. In this work, we propose a novel hybrid approach to deal with this problem. We create several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are fed to classical ML systems that produce rule sets. The rule sets are combined creating a pool of rules and an EA is used to build a classifier from this pool of rules. This hybrid approach has some advantages over undersampling, since it reduces the amount of discarded information, and some advantages over oversampling, since it avoids overfitting. The proposed approach was experimentally analysed and the experimental results show an improvement in the classification performance measured as the area under the receiver operating characteristics (ROC) curve.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper, a hybrid online learning model that combines the fuzzy min-max (FMM) neural network and the Classification and Regression Tree (CART) for motor fault detection and diagnosis tasks is described. The hybrid model, known as FMM-CART, incorporates the advantages of both FMM and CART for undertaking data classification (with FMM) and rule extraction (with CART) problems. In particular, the CART model is enhanced with an importance predictor-based feature selection measure. To evaluate the effectiveness of the proposed online FMM-CART model, a series of experiments using publicly available data sets containing motor bearing faults is first conducted. The results (primarily prediction accuracy and model complexity) are analyzed and compared with those reported in the literature. Then, an experimental study on detecting imbalanced voltage supply of an induction motor using a laboratory-scale test rig is performed. In addition to producing accurate results, a set of rules in the form of a decision tree is extracted from FMM-CART to provide explanations for its predictions. The results positively demonstrate the usefulness of FMM-CART with online learning capabilities in tackling real-world motor fault detection and diagnosis tasks. © 2014 Springer Science+Business Media New York.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Being an important source for real-time information dissemination in recent years, Twitter is inevitably a prime target of spammers. It has been showed that the damage caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the threat, a lot of recent studies use machine learning techniques to classify Twitter spam and report very satisfactory results. However, most of the studies overlook a fundamental issue that is widely seen in real-world Twitter data, i.e., the class imbalance problem. In this paper, we show that the unequal distribution between spam and non-spam classes in the data has a great impact on spam detection rate. To address the problem, we propose an ensemble learning approach, which involves three steps. In the first step, we adjust the class distribution in the imbalanced data set using various strategies, including random oversampling, random undersampling and fuzzy-based oversampling. In the next step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine all the classification models. Experimental results obtained using real-world Twitter data indicate that the proposed approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Multi-Task Transfer Learning (MTTL) is an efficient approach for learning from inter-related tasks with small sample size and imbalanced class distribution. Since the intensive care unit (ICU) data set (publicly available in Physionet) has subjects from four different ICU types, we hypothesizethat there is an underlying relatedness amongst various ICU types. Therefore, this study aims to explore MTTL model for in-hospital mortality prediction of ICU patients. We used singletask learning (STL) approach on the augmented data as well as individual ICU data and compared the performance with the proposed MTTL model. As a performance measurement metrics, we used sensitivity (Sens), positive predictivity (+Pred), and Score. MTTL with class balancing showed the best performance with score of 0.78, 0.73, o.52 and 0.63 for ICU type 1(Coronary care unit), 2 (Cardiac surgery unit), 3 (Medical ICU) and 4 (Surgical ICU) respectively. In contrast the maximum score obtained using STL approach was 0.40 for ICU type 1 & 2. These results indicates that the performance of in-hospital mortality can be improved using ICU type information and by balancing the ’non-survivor’ class. The findings of the study may be useful for quantifying the quality of ICU care, managing ICU resources and selecting appropriate interventions.