13 resultados para class imbalance problems
em Archivo Digital para la Docencia y la Investigación - Repositorio Institucional de la Universidad del País Vasco
Resumo:
In recent years, the performance of semi-supervised learning has been theoretically investigated. However, most of this theoretical development has focussed on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover [1] [2] to the multi-class paradigm. Particularly, we consider the key problem in semi-supervised learning of classifying an unseen instance x into one of K different classes, using a training dataset sampled from a mixture density distribution and composed of l labelled records and u unlabelled examples. Even under the assumption of identifiability of the mixture and having infinite unlabelled examples, labelled records are needed to determine the K decision regions. Therefore, in this paper, we first investigate the minimum number of labelled examples needed to accomplish that task. Then, we propose an optimal multi-class learning algorithm which is a generalisation of the optimal procedure proposed in the literature for binary problems. Finally, we make use of this generalisation to study the probability of error when the binary class constraint is relaxed.
Resumo:
This document aims to describe an update of the implementation of the J48Consolidated class within WEKA platform. The J48Consolidated class implements the CTC algorithm [2][3] which builds a unique decision tree based on a set of samples. The J48Consolidated class extends WEKA’s J48 class which implements the well-known C4.5 algorithm. This implementation was described in the technical report "J48Consolidated: An implementation of CTC algorithm for WEKA". The main, but not only, change in this update is the integration of the notion of coverage in order to determine the number of samples to be generated to build a consolidated tree. We define coverage as the percentage of examples of the training sample present in –or covered by– the set of generated subsamples. So, depending on the type of samples that we use, we will need more or less samples in order to achieve a specific value of coverage.
Resumo:
En este proyecto se analiza y compara el comportamiento del algoritmo CTC diseñado por el grupo de investigación ALDAPA usando bases de datos muy desbalanceadas. En concreto se emplea un conjunto de bases de datos disponibles en el sitio web asociado al proyecto KEEL (http://sci2s.ugr.es/keel/index.php) y que han sido ya utilizadas con diferentes algoritmos diseñados para afrontar el problema de clases desbalanceadas (Class imbalance problem) en el siguiente trabajo: A. Fernandez, S. García, J. Luengo, E. Bernadó-Mansilla, F. Herrera, "Genetics-Based Machine Learning for Rule Induction: State of the Art, Taxonomy and Comparative Study". IEEE Transactions on Evolutionary Computation 14:6 (2010) 913-941, http://dx.doi.org/10.1109/TEVC.2009.2039140 Las bases de datos (incluidas las muestras del cross-validation), junto con los resultados obtenidos asociados a la experimentación de este trabajo se pueden encontrar en un sitio web creado a tal efecto: http://sci2s.ugr.es/gbml/. Esto hace que los resultados del CTC obtenidos con estas muestras sean directamente comparables con los obtenidos por todos los algoritmos obtenidos en este trabajo.
Resumo:
The CTC algorithm, Consolidated Tree Construction algorithm, is a machine learning paradigm that was designed to solve a class imbalance problem, a fraud detection problem in the area of car insurance [1] where, besides, an explanation about the classification made was required. The algorithm is based on a decision tree construction algorithm, in this case the well-known C4.5, but it extracts knowledge from data using a set of samples instead of a single one as C4.5 does. In contrast to other methodologies based on several samples to build a classifier, such as bagging, the CTC builds a single tree and as a consequence, it obtains comprehensible classifiers. The main motivation of this implementation is to make public and available an implementation of the CTC algorithm. With this purpose we have implemented the algorithm within the well-known WEKA data mining environment http://www.cs.waikato.ac.nz/ml/weka/). WEKA is an open source project that contains a collection of machine learning algorithms written in Java for data mining tasks. J48 is the implementation of C4.5 algorithm within the WEKA package. We called J48Consolidated to the implementation of CTC algorithm based on the J48 Java class.
Resumo:
Some results on fixed points related to the contractive compositions of bounded operators in a class of complete metric spaces which can be also considered as Banach's spaces are discussed through the paper. The class of composite operators under study can include, in particular, sequences of projection operators under, in general, oblique projective operators. In this paper we are concerned with composite operators which include sequences of pairs of contractive operators involving, in general, oblique projection operators. The results are generalized to sequences of, in general, nonconstant bounded closed operators which can have bounded, closed, and compact limit operators, such that the relevant composite sequences are also compact operators. It is proven that in both cases, Banach contraction principle guarantees the existence of unique fixed points under contractive conditions.
Resumo:
In the problem of one-class classification (OCC) one of the classes, the target class, has to be distinguished from all other possible objects, considered as nontargets. In many biomedical problems this situation arises, for example, in diagnosis, image based tumor recognition or analysis of electrocardiogram data. In this paper an approach to OCC based on a typicality test is experimentally compared with reference state-of-the-art OCC techniques-Gaussian, mixture of Gaussians, naive Parzen, Parzen, and support vector data description-using biomedical data sets. We evaluate the ability of the procedures using twelve experimental data sets with not necessarily continuous data. As there are few benchmark data sets for one-class classification, all data sets considered in the evaluation have multiple classes. Each class in turn is considered as the target class and the units in the other classes are considered as new units to be classified. The results of the comparison show the good performance of the typicality approach, which is available for high dimensional data; it is worth mentioning that it can be used for any kind of data (continuous, discrete, or nominal), whereas state-of-the-art approaches application is not straightforward when nominal variables are present.
Resumo:
The lack of stability in some matching problems suggests that alternative solution concepts to the core might be applied to find predictable matchings. We propose the absorbing sets as a solution for the class of roommate problems with strict preferences. This solution, which always exists, either gives the matchings in the core or predicts some other matchings when the core is empty. Furthermore, it satisfies an interesting property of outer stability. We also characterize the absorbing sets, determine their number and, in case of multiplicity, we find that they all share a similar structure.
Resumo:
The aim of this technical report is to present some detailed explanations in order to help to understand and use the Message Passing Interface (MPI) parallel programming for solving several mixed integer optimization problems. We have developed a C++ experimental code that uses the IBM ILOG CPLEX optimizer within the COmputational INfrastructure for Operations Research (COIN-OR) and MPI parallel computing for solving the optimization models under UNIX-like systems. The computational experience illustrates how can we solve 44 optimization problems which are asymmetric with respect to the number of integer and continuous variables and the number of constraints. We also report a comparative with the speedup and efficiency of several strategies implemented for some available number of threads.
Resumo:
Eguíluz, Federico; Merino, Raquel; Olsen, Vickie; Pajares, Eterio; Santamaría, José Miguel (eds.)
Resumo:
This paper investigates a class of self-adjoint compact operators in Hilbert spaces related to their truncated versions with finite-dimensional ranges. The comparisons are established in terms of worst-case norm errors of the composite operators generated from iterated computations. Some boundedness properties of the worst-case norms of the errors in their respective fixed points in which they exist are also given. The iterated sequences are expanded in separable Hilbert spaces through the use of numerable orthonormal bases.
Resumo:
This paper is focused on the study of the important property of the asymptotic hyperstability of a class of continuous-time dynamic systems. The presence of a parallel connection of a strictly stable subsystem to an asymptotically hyperstable one in the feed-forward loop is allowed while it has also admitted the generation of a finite or infinite number of impulsive control actions which can be combined with a general form of nonimpulsive controls. The asymptotic hyperstability property is guaranteed under a set of sufficiency-type conditions for the impulsive controls.
Resumo:
353 págs.