58 resultados para stopping rule


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Top Down Induction of Decision Trees (TDIDT) is the most commonly used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. However, along with the increasing size of databases, many existing rule learning algorithms have proved to be computational expensive on large datasets. To tackle the problem of scalability, parallel classification rule induction algorithms have been introduced. As TDIDT is the most popular classifier, even though there are strongly competitive alternative algorithms, most parallel approaches to inducing classification rules are based on TDIDT. In this paper we describe work on a distributed classifier that induces classification rules in a parallel manner based on Prism.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Induction of classification rules is one of the most important technologies in data mining. Most of the work in this field has concentrated on the Top Down Induction of Decision Trees (TDIDT) approach. However, alternative approaches have been developed such as the Prism algorithm for inducing modular rules. Prism often produces qualitatively better rules than TDIDT but suffers from higher computational requirements. We investigate approaches that have been developed to minimize the computational requirements of TDIDT, in order to find analogous approaches that could reduce the computational requirements of Prism.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Seamless phase II/III clinical trials combine traditional phases II and III into a single trial that is conducted in two stages, with stage 1 used to answer phase II objectives such as treatment selection and stage 2 used for the confirmatory analysis, which is a phase III objective. Although seamless phase II/III clinical trials are efficient because the confirmatory analysis includes phase II data from stage 1, inference can pose statistical challenges. In this paper, we consider point estimation following seamless phase II/III clinical trials in which stage 1 is used to select the most effective experimental treatment and to decide if, compared with a control, the trial should stop at stage 1 for futility. If the trial is not stopped, then the phase III confirmatory part of the trial involves evaluation of the selected most effective experimental treatment and the control. We have developed two new estimators for the treatment difference between these two treatments with the aim of reducing bias conditional on the treatment selection made and on the fact that the trial continues to stage 2. We have demonstrated the properties of these estimators using simulations

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper considers the use of Association Rule Mining (ARM) and our proposed Transaction based Rule Change Mining (TRCM) to identify the rule types present in tweet’s hashtags over a specific consecutive period of time and their linkage to real life occurrences. Our novel algorithm was termed TRCM-RTI in reference to Rule Type Identification. We created Time Frame Windows (TFWs) to detect evolvement statuses and calculate the lifespan of hashtags in online tweets. We link RTI to real life events by monitoring and recording rule evolvement patterns in TFWs on the Twitter network.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfit-ting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances. This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suit-able structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations on Big Data: rule generation, rule simplification and rule representation. The authors also review some existing methods and techniques used for each of the three operations and highlight their limitations. They introduce some novel methods and techniques developed by them recently. These methods and techniques are also discussed in comparison to existing ones with respect to efficient processing of Big Data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Expert systems have been increasingly popular for commercial importance. A rule based system is a special type of an expert system, which consists of a set of ‘if-then‘ rules and can be applied as a decision support system in many areas such as healthcare, transportation and security. Rule based systems can be constructed based on both expert knowledge and data. This paper aims to introduce the theory of rule based systems especially on categorization and construction of such systems from a conceptual point of view. This paper also introduces rule based systems for classification tasks in detail.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

According to dual-system accounts of English past-tense processing, regular forms are decomposed into their stem and affix (played=play+ed) based on an implicit linguistic rule, whereas irregular forms (kept) are retrieved directly from the mental lexicon. In second language (L2) processing research, it has been suggested that L2 learners do not have rule-based decomposing abilities, so they process regular past-tense forms similarly to irregular ones (Silva & Clahsen 2008), without applying the morphological rule. The present study investigates morphological processing of regular and irregular verbs in Greek-English L2 learners and native English speakers. In a masked-priming experiment with regular and irregular prime-target verb pairs (playedplay/kept-keep), native speakers showed priming effects for regular pairs, compared to unrelated pairs, indicating decomposition; conversely, L2 learners showed inhibitory effects. At the same time, both groups revealed priming effects for irregular pairs. We discuss these findings in the light of available theories on L2 morphological processing.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This chapter considers the possible use in armed conflict of low-yield (also known as tactical) nuclear weapons. The Legality of the Threat or Use of Nuclear Weapons Advisory Opinion maintained that it is a cardinal principle that a State must never make civilians an object of attack and must consequently never use weapons that are incapable of distinguishing between civilian and military targets. As international humanitarian law applies equally to any use of nuclear weapons, it is argued that there is no use of nuclear weapons that could spare civilian casualties particularly if you view the long-term health and environmental effects of the use of such weaponry.