875 resultados para Rule induction
Resumo:
In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.
Resumo:
In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.
Resumo:
Top Down Induction of Decision Trees (TDIDT) is the most commonly used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. However, along with the increasing size of databases, many existing rule learning algorithms have proved to be computational expensive on large datasets. To tackle the problem of scalability, parallel classification rule induction algorithms have been introduced. As TDIDT is the most popular classifier, even though there are strongly competitive alternative algorithms, most parallel approaches to inducing classification rules are based on TDIDT. In this paper we describe work on a distributed classifier that induces classification rules in a parallel manner based on Prism.
Resumo:
Induction of classification rules is one of the most important technologies in data mining. Most of the work in this field has concentrated on the Top Down Induction of Decision Trees (TDIDT) approach. However, alternative approaches have been developed such as the Prism algorithm for inducing modular rules. Prism often produces qualitatively better rules than TDIDT but suffers from higher computational requirements. We investigate approaches that have been developed to minimize the computational requirements of TDIDT, in order to find analogous approaches that could reduce the computational requirements of Prism.
Resumo:
The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.
Resumo:
In order to gain insights into events and issues that may cause errors and outages in parts of IP networks, intelligent methods that capture and express causal relationships online (in real-time) are needed. Whereas generalised rule induction has been explored for non-streaming data applications, its application and adaptation on streaming data is mostly undeveloped or based on periodic and ad-hoc training with batch algorithms. Some association rule mining approaches for streaming data do exist, however, they can only express binary causal relationships. This paper presents the ongoing work on Online Generalised Rule Induction (OGRI) in order to create expressive and adaptive rule sets real-time that can be applied to a broad range of applications, including network telemetry data streams.
Resumo:
This paper describes how the statistical technique of cluster analysis and the machine learning technique of rule induction can be combined to explore a database. The ways in which such an approach alleviates the problems associated with other techniques for data analysis are discussed. We report the results of experiments carried out on a database from the medical diagnosis domain. Finally we describe the future developments which we plan to carry out to build on our current work.
Resumo:
Peer reviewed
Resumo:
The Distributed Rule Induction (DRI) project at the University of Portsmouth is concerned with distributed data mining algorithms for automatically generating rules of all kinds. In this paper we present a system architecture and its implementation for inducing modular classification rules in parallel in a local area network using a distributed blackboard system. We present initial results of a prototype implementation based on the Prism algorithm.
Resumo:
In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.
Resumo:
Advances in hardware and software technologies allow to capture streaming data. The area of Data Stream Mining (DSM) is concerned with the analysis of these vast amounts of data as it is generated in real-time. Data stream classification is one of the most important DSM techniques allowing to classify previously unseen data instances. Different to traditional classifiers for static data, data stream classifiers need to adapt to concept changes (concept drift) in the stream in real-time in order to reflect the most recent concept in the data as accurately as possible. A recent addition to the data stream classifier toolbox is eRules which induces and updates a set of expressive rules that can easily be interpreted by humans. However, like most rule-based data stream classifiers, eRules exhibits a poor computational performance when confronted with continuous attributes. In this work, we propose an approach to deal with continuous data effectively and accurately in rule-based classifiers by using the Gaussian distribution as heuristic for building rule terms on continuous attributes. We show on the example of eRules that incorporating our method for continuous attributes indeed speeds up the real-time rule induction process while maintaining a similar level of accuracy compared with the original eRules classifier. We termed this new version of eRules with our approach G-eRules.
Resumo:
Examples from the Murray-Darling basin in Australia are used to illustrate different methods of disaggregation of reconnaissance-scale maps. One approach for disaggregation revolves around the de-convolution of the soil-landscape paradigm elaborated during a soil survey. The descriptions of soil ma units and block diagrams in a soil survey report detail soil-landscape relationships or soil toposequences that can be used to disaggregate map units into component landscape elements. Toposequences can be visualised on a computer by combining soil maps with digital elevation data. Expert knowledge or statistics can be used to implement the disaggregation. Use of a restructuring element and k-means clustering are illustrated. Another approach to disaggregation uses training areas to develop rules to extrapolate detailed mapping into other, larger areas where detailed mapping is unavailable. A two-level decision tree example is presented. At one level, the decision tree method is used to capture mapping rules from the training area; at another level, it is used to define the domain over which those rules can be extrapolated. (C) 2001 Elsevier Science B.V. All rights reserved.
Resumo:
The principle of using induction rules based on spatial environmental data to model a soil map has previously been demonstrated Whilst the general pattern of classes of large spatial extent and those with close association with geology were delineated small classes and the detailed spatial pattern of the map were less well rendered Here we examine several strategies to improve the quality of the soil map models generated by rule induction Terrain attributes that are better suited to landscape description at a resolution of 250 m are introduced as predictors of soil type A map sampling strategy is developed Classification error is reduced by using boosting rather than cross validation to improve the model Further the benefit of incorporating the local spatial context for each environmental variable into the rule induction is examined The best model was achieved by sampling in proportion to the spatial extent of the mapped classes boosting the decision trees and using spatial contextual information extracted from the environmental variables.
Resumo:
We re-mapped the soils of the Murray-Darling Basin (MDB) in 1995-1998 with a minimum of new fieldwork, making the most out of existing data. We collated existing digital soil maps and used inductive spatial modelling to predict soil types from those maps combined with environmental predictor variables. Lithology, Landsat Multi Spectral Scanner (Landsat MSS), the 9-s digital elevation model (DEM) of Australia and derived terrain attributes, all gridded to 250-m pixels, were the predictor variables. Because the basin-wide datasets were very large data mining software was used for modelling. Rule induction by data mining was also used to define the spatial domain of extrapolation for the extension of soil-landscape models from existing soil maps. Procedures to estimate the uncertainty associated with the predictions and quality of information for the new soil-landforms map of the MDB are described. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
This thesis proposes a framework for identifying the root-cause of a voltage disturbance, as well as, its source location (upstream/downstream) from the monitoring place. The framework works with three-phase voltage and current waveforms collected in radial distribution networks without distributed generation. Real-world and synthetic waveforms are used to test it. The framework involves features that are conceived based on electrical principles, and assuming some hypothesis on the analyzed phenomena. Features considered are based on waveforms and timestamp information. Multivariate analysis of variance and rule induction algorithms are applied to assess the amount of meaningful information explained by each feature, according to the root-cause of the disturbance and its source location. The obtained classification rates show that the proposed framework could be used for automatic diagnosis of voltage disturbances collected in radial distribution networks. Furthermore, the diagnostic results can be subsequently used for supporting power network operation, maintenance and planning.