971 resultados para Decision trees
Resumo:
The Prism family of algorithms induces modular classification rules which, in contrast to decision tree induction algorithms, do not necessarily fit together into a decision tree structure. Classifiers induced by Prism algorithms achieve a comparable accuracy compared with decision trees and in some cases even outperform decision trees. Both kinds of algorithms tend to overfit on large and noisy datasets and this has led to the development of pruning methods. Pruning methods use various metrics to truncate decision trees or to eliminate whole rules or single rule terms from a Prism rule set. For decision trees many pre-pruning and postpruning methods exist, however for Prism algorithms only one pre-pruning method has been developed, J-pruning. Recent work with Prism algorithms examined J-pruning in the context of very large datasets and found that the current method does not use its full potential. This paper revisits the J-pruning method for the Prism family of algorithms and develops a new pruning method Jmax-pruning, discusses it in theoretical terms and evaluates it empirically.
Resumo:
The Prism family of algorithms induces modular classification rules in contrast to the Top Down Induction of Decision Trees (TDIDT) approach which induces classification rules in the intermediate form of a tree structure. Both approaches achieve a comparable classification accuracy. However in some cases Prism outperforms TDIDT. For both approaches pre-pruning facilities have been developed in order to prevent the induced classifiers from overfitting on noisy datasets, by cutting rule terms or whole rules or by truncating decision trees according to certain metrics. There have been many pre-pruning mechanisms developed for the TDIDT approach, but for the Prism family the only existing pre-pruning facility is J-pruning. J-pruning not only works on Prism algorithms but also on TDIDT. Although it has been shown that J-pruning produces good results, this work points out that J-pruning does not use its full potential. The original J-pruning facility is examined and the use of a new pre-pruning facility, called Jmax-pruning, is proposed and evaluated empirically. A possible pre-pruning facility for TDIDT based on Jmax-pruning is also discussed.
Resumo:
Ensemble learning techniques generate multiple classifiers, so called base classifiers, whose combined classification results are used in order to increase the overall classification accuracy. In most ensemble classifiers the base classifiers are based on the Top Down Induction of Decision Trees (TDIDT) approach. However, an alternative approach for the induction of rule based classifiers is the Prism family of algorithms. Prism algorithms produce modular classification rules that do not necessarily fit into a decision tree structure. Prism classification rulesets achieve a comparable and sometimes higher classification accuracy compared with decision tree classifiers, if the data is noisy and large. Yet Prism still suffers from overfitting on noisy and large datasets. In practice ensemble techniques tend to reduce the overfitting, however there exists no ensemble learner for modular classification rule inducers such as the Prism family of algorithms. This article describes the first development of an ensemble learner based on the Prism family of algorithms in order to enhance Prism’s classification accuracy by reducing overfitting.
Resumo:
In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.
Resumo:
Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.
Resumo:
Ensemble learning can be used to increase the overall classification accuracy of a classifier by generating multiple base classifiers and combining their classification results. A frequently used family of base classifiers for ensemble learning are decision trees. However, alternative approaches can potentially be used, such as the Prism family of algorithms that also induces classification rules. Compared with decision trees, Prism algorithms generate modular classification rules that cannot necessarily be represented in the form of a decision tree. Prism algorithms produce a similar classification accuracy compared with decision trees. However, in some cases, for example, if there is noise in the training and test data, Prism algorithms can outperform decision trees by achieving a higher classification accuracy. However, Prism still tends to overfit on noisy data; hence, ensemble learners have been adopted in this work to reduce the overfitting. This paper describes the development of an ensemble learner using a member of the Prism family as the base classifier to reduce the overfitting of Prism algorithms on noisy datasets. The developed ensemble classifier is compared with a stand-alone Prism classifier in terms of classification accuracy and resistance to noise.
Resumo:
Prism is a modular classification rule generation method based on the ‘separate and conquer’ approach that is alternative to the rule induction approach using decision trees also known as ‘divide and conquer’. Prism often achieves a similar level of classification accuracy compared with decision trees, but tends to produce a more compact noise tolerant set of classification rules. As with other classification rule generation methods, a principle problem arising with Prism is that of overfitting due to over-specialised rules. In addition, over-specialised rules increase the associated computational complexity. These problems can be solved by pruning methods. For the Prism method, two pruning algorithms have been introduced recently for reducing overfitting of classification rules - J-pruning and Jmax-pruning. Both algorithms are based on the J-measure, an information theoretic means for quantifying the theoretical information content of a rule. Jmax-pruning attempts to exploit the J-measure to its full potential because J-pruning does not actually achieve this and may even lead to underfitting. A series of experiments have proved that Jmax-pruning may outperform J-pruning in reducing overfitting. However, Jmax-pruning is computationally relatively expensive and may also lead to underfitting. This paper reviews the Prism method and the two existing pruning algorithms above. It also proposes a novel pruning algorithm called Jmid-pruning. The latter is based on the J-measure and it reduces overfitting to a similar level as the other two algorithms but is better in avoiding underfitting and unnecessary computational effort. The authors conduct an experimental study on the performance of the Jmid-pruning algorithm in terms of classification accuracy and computational efficiency. The algorithm is also evaluated comparatively with the J-pruning and Jmax-pruning algorithms.
Resumo:
Land cover maps at different resolutions and mapping extents contribute to modeling and support decision making processes. Because land cover affects and is affected by climate change, it is listed among the 13 terrestrial essential climate variables. This paper describes the generation of a land cover map for Latin America and the Caribbean (LAC) for the year 2008. It was developed in the framework of the project Latin American Network for Monitoring and Studying of Natural Resources (SERENA), which has been developed within the GOFC-GOLD Latin American network of remote sensing and forest fires (RedLaTIF). The SERENA land cover map for LAC integrates: 1) the local expertise of SERENA network members to generate the training and validation data, 2) a methodology for land cover mapping based on decision trees using MODIS time series, and 3) class membership estimates to account for pixel heterogeneity issues. The discrete SERENA land cover product, derived from class memberships, yields an overall accuracy of 84% and includes an additional layer representing the estimated per-pixel confidence. The study demonstrates in detail the use of class memberships to better estimate the area of scarce classes with a scattered spatial distribution. The land cover map is already available as a printed wall map and will be released in digital format in the near future. The SERENA land cover map was produced with a legend and classification strategy similar to that used by the North American Land Change Monitoring System (NALCMS) to generate a land cover map of the North American continent, that will allow to combine both maps to generate consistent data across America facilitating continental monitoring and modeling
Resumo:
The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.
Resumo:
In a global economy, manufacturers mainly compete with cost efficiency of production, as the price of raw materials are similar worldwide. Heavy industry has two big issues to deal with. On the one hand there is lots of data which needs to be analyzed in an effective manner, and on the other hand making big improvements via investments in cooperate structure or new machinery is neither economically nor physically viable. Machine learning offers a promising way for manufacturers to address both these problems as they are in an excellent position to employ learning techniques with their massive resource of historical production data. However, choosing modelling a strategy in this setting is far from trivial and this is the objective of this article. The article investigates characteristics of the most popular classifiers used in industry today. Support Vector Machines, Multilayer Perceptron, Decision Trees, Random Forests, and the meta-algorithms Bagging and Boosting are mainly investigated in this work. Lessons from real-world implementations of these learners are also provided together with future directions when different learners are expected to perform well. The importance of feature selection and relevant selection methods in an industrial setting are further investigated. Performance metrics have also been discussed for the sake of completion.
Resumo:
Este trabalho analisa, sob uma perspectiva quantitativa, a retenção de clientes durante o processo de renegociação de créditos inadimplentes. O foco principal é entender quais são as variáveis que explicam a retenção destes clientes e, portanto, aprimorar o processo de cobrança de uma instituição financeira no Brasil. O tema se torna relevante à medida em que vários fatores tornam a competitividade mais difícil no ambiente de crédito no país: a concentração bancária vivida na última década, o aumento da oferta de crédito nos últimos anos, a redução dos spreads bancários, e por fim a crise econômica global que afeta em especial o setor financeiro. A pesquisa procura investigar quais variáveis melhor explicam o fenômeno da retenção. Para tanto, foram segregados clientes projetados como rentáveis pela cadeia de Markov. Em seguida, testou-se a aderência de variáveis cadastrais e contratuais à variável-resposta retenção, por duas metodologias: o algoritmo CHAID da árvore de decisão e o método stepwise da regressão logística. Os resultados indicam que o método CHAID selecionou 7 e o stepwise 8 variáveis, sendo algumas de natureza cadastral e outras que vêm do próprio contrato de renegociação. Dado que as condições do contrato influenciam a retenção e portanto o valor do cliente, sugere-se que o processo de oferta incorpore operacionalmente a noção de retenção na atividade da cobrança.
Resumo:
O cenário de continuo aumento do consumo de derivados do petróleo aliado a conscientização de que é necessário existir um equilíbrio com relação a exploração de recursos naturais e preservação do meio ambiente, vem impulsionando a busca por fontes alternativas de energia. Esse crescente interesse vem se aplicando a geração de energia a partir de biomassa da cana de açúcar, que vem se tornando cada vez mais comuns no Brasil, porém ainda existe um imenso potencial a ser explorado. Dentro deste contexto, se torna relevante a tomada de decisão de investimentos em projetos de cogeração e este trabalho busca incrementar a analise e tomada de decisão com a utilização da Teoria das Opções Reais, uma ferramenta de agregação de valor às incertezas, cabendo perfeitamente ao modelo energético brasileiro, onde grandes volatilidades do preço de energia são observadas ao longo dos anos. O objetivo do trabalho é determinar o melhor momento para uma biorrefinaria investir em unidades de cogeração. A estrutura do trabalho foi dividida em três cenários de porte de biorrefinarias, as de 2 milhões de capacidade de moagem de cana-de-açúcar por ano, as de 4 milhões e as de 6 milhões, visando assim ter uma representação amostral das biorrefinarias do país. Além disso, analisaram-se três cenários de volatilidade atrelados ao preço futuro de energia, dado que a principal variável de viabilização deste tipo de projeto é o preço de energia. As volatilidades foram calculadas de acordo com histórico do ambiente regulado, o dobro do ambiente regulado e projeção de PLD, representando, respectivamente, níveis baixos, médios e altos, de volatilidade do preço de energia. Após isso, foram elaboradas as nove árvores de decisão, que demonstram para os gestores de investimento que em um cenário de baixa volatilidade cria-se valor estar posicionado e ter a opção real de investir ou adiar investimento para qualquer porte de usina. No cenário de média volatilidade de preço, aconselha-se ao gestor estar posicionado em usinas de médio a grande porte para viabilização do investimento. Por fim, quando o cenário de preços é de grande volatilidade, tem-se um maior risco e existe a maior probabilidade de viabilização do investimento em usinas de grande porte.
Resumo:
The industries are getting more and more rigorous, when security is in question, no matter is to avoid financial damages due to accidents and low productivity, or when it s related to the environment protection. It was thinking about great world accidents around the world involving aircrafts and industrial process (nuclear, petrochemical and so on) that we decided to invest in systems that could detect fault and diagnosis (FDD) them. The FDD systems can avoid eventual fault helping man on the maintenance and exchange of defective equipments. Nowadays, the issues that involve detection, isolation, diagnose and the controlling of tolerance fault are gathering strength in the academic and industrial environment. It is based on this fact, in this work, we discuss the importance of techniques that can assist in the development of systems for Fault Detection and Diagnosis (FDD) and propose a hybrid method for FDD in dynamic systems. We present a brief history to contextualize the techniques used in working environments. The detection of fault in the proposed system is based on state observers in conjunction with other statistical techniques. The principal idea is to use the observer himself, in addition to serving as an analytical redundancy, in allowing the creation of a residue. This residue is used in FDD. A signature database assists in the identification of system faults, which based on the signatures derived from trend analysis of the residue signal and its difference, performs the classification of the faults based purely on a decision tree. This FDD system is tested and validated in two plants: a simulated plant with coupled tanks and didactic plant with industrial instrumentation. All collected results of those tests will be discussed
Resumo:
Equipment maintenance is the major cost factor in industrial plants, it is very important the development of fault predict techniques. Three-phase induction motors are key electrical equipments used in industrial applications mainly because presents low cost and large robustness, however, it isn t protected from other fault types such as shorted winding and broken bars. Several acquisition ways, processing and signal analysis are applied to improve its diagnosis. More efficient techniques use current sensors and its signature analysis. In this dissertation, starting of these sensors, it is to make signal analysis through Park s vector that provides a good visualization capability. Faults data acquisition is an arduous task; in this way, it is developed a methodology for data base construction. Park s transformer is applied into stationary reference for machine modeling of the machine s differential equations solution. Faults detection needs a detailed analysis of variables and its influences that becomes the diagnosis more complex. The tasks of pattern recognition allow that systems are automatically generated, based in patterns and data concepts, in the majority cases undetectable for specialists, helping decision tasks. Classifiers algorithms with diverse learning paradigms: k-Neighborhood, Neural Networks, Decision Trees and Naïves Bayes are used to patterns recognition of machines faults. Multi-classifier systems are used to improve classification errors. It inspected the algorithms homogeneous: Bagging and Boosting and heterogeneous: Vote, Stacking and Stacking C. Results present the effectiveness of constructed model to faults modeling, such as the possibility of using multi-classifiers algorithm on faults classification