2 resultados para WEKA
Resumo:
The CTC algorithm, Consolidated Tree Construction algorithm, is a machine learning paradigm that was designed to solve a class imbalance problem, a fraud detection problem in the area of car insurance [1] where, besides, an explanation about the classification made was required. The algorithm is based on a decision tree construction algorithm, in this case the well-known C4.5, but it extracts knowledge from data using a set of samples instead of a single one as C4.5 does. In contrast to other methodologies based on several samples to build a classifier, such as bagging, the CTC builds a single tree and as a consequence, it obtains comprehensible classifiers. The main motivation of this implementation is to make public and available an implementation of the CTC algorithm. With this purpose we have implemented the algorithm within the well-known WEKA data mining environment http://www.cs.waikato.ac.nz/ml/weka/). WEKA is an open source project that contains a collection of machine learning algorithms written in Java for data mining tasks. J48 is the implementation of C4.5 algorithm within the WEKA package. We called J48Consolidated to the implementation of CTC algorithm based on the J48 Java class.
Resumo:
This document aims to describe an update of the implementation of the J48Consolidated class within WEKA platform. The J48Consolidated class implements the CTC algorithm [2][3] which builds a unique decision tree based on a set of samples. The J48Consolidated class extends WEKA’s J48 class which implements the well-known C4.5 algorithm. This implementation was described in the technical report "J48Consolidated: An implementation of CTC algorithm for WEKA". The main, but not only, change in this update is the integration of the notion of coverage in order to determine the number of samples to be generated to build a consolidated tree. We define coverage as the percentage of examples of the training sample present in –or covered by– the set of generated subsamples. So, depending on the type of samples that we use, we will need more or less samples in order to achieve a specific value of coverage.