749 resultados para Hydrological classification
Resumo:
The Distributed Rule Induction (DRI) project at the University of Portsmouth is concerned with distributed data mining algorithms for automatically generating rules of all kinds. In this paper we present a system architecture and its implementation for inducing modular classification rules in parallel in a local area network using a distributed blackboard system. We present initial results of a prototype implementation based on the Prism algorithm.
Resumo:
Top Down Induction of Decision Trees (TDIDT) is the most commonly used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. However, along with the increasing size of databases, many existing rule learning algorithms have proved to be computational expensive on large datasets. To tackle the problem of scalability, parallel classification rule induction algorithms have been introduced. As TDIDT is the most popular classifier, even though there are strongly competitive alternative algorithms, most parallel approaches to inducing classification rules are based on TDIDT. In this paper we describe work on a distributed classifier that induces classification rules in a parallel manner based on Prism.
Resumo:
Induction of classification rules is one of the most important technologies in data mining. Most of the work in this field has concentrated on the Top Down Induction of Decision Trees (TDIDT) approach. However, alternative approaches have been developed such as the Prism algorithm for inducing modular rules. Prism often produces qualitatively better rules than TDIDT but suffers from higher computational requirements. We investigate approaches that have been developed to minimize the computational requirements of TDIDT, in order to find analogous approaches that could reduce the computational requirements of Prism.
Resumo:
Inducing rules from very large datasets is one of the most challenging areas in data mining. Several approaches exist to scaling up classification rule induction to large datasets, namely data reduction and the parallelisation of classification rule induction algorithms. In the area of parallelisation of classification rule induction algorithms most of the work has been concentrated on the Top Down Induction of Decision Trees (TDIDT), also known as the ‘divide and conquer’ approach. However powerful alternative algorithms exist that induce modular rules. Most of these alternative algorithms follow the ‘separate and conquer’ approach of inducing rules, but very little work has been done to make the ‘separate and conquer’ approach scale better on large training data. This paper examines the potential of the recently developed blackboard based J-PMCRI methodology for parallelising modular classification rule induction algorithms that follow the ‘separate and conquer’ approach. A concrete implementation of the methodology is evaluated empirically on very large datasets.
Resumo:
The Prism family of algorithms induces modular classification rules which, in contrast to decision tree induction algorithms, do not necessarily fit together into a decision tree structure. Classifiers induced by Prism algorithms achieve a comparable accuracy compared with decision trees and in some cases even outperform decision trees. Both kinds of algorithms tend to overfit on large and noisy datasets and this has led to the development of pruning methods. Pruning methods use various metrics to truncate decision trees or to eliminate whole rules or single rule terms from a Prism rule set. For decision trees many pre-pruning and postpruning methods exist, however for Prism algorithms only one pre-pruning method has been developed, J-pruning. Recent work with Prism algorithms examined J-pruning in the context of very large datasets and found that the current method does not use its full potential. This paper revisits the J-pruning method for the Prism family of algorithms and develops a new pruning method Jmax-pruning, discusses it in theoretical terms and evaluates it empirically.
Resumo:
The Prism family of algorithms induces modular classification rules in contrast to the Top Down Induction of Decision Trees (TDIDT) approach which induces classification rules in the intermediate form of a tree structure. Both approaches achieve a comparable classification accuracy. However in some cases Prism outperforms TDIDT. For both approaches pre-pruning facilities have been developed in order to prevent the induced classifiers from overfitting on noisy datasets, by cutting rule terms or whole rules or by truncating decision trees according to certain metrics. There have been many pre-pruning mechanisms developed for the TDIDT approach, but for the Prism family the only existing pre-pruning facility is J-pruning. J-pruning not only works on Prism algorithms but also on TDIDT. Although it has been shown that J-pruning produces good results, this work points out that J-pruning does not use its full potential. The original J-pruning facility is examined and the use of a new pre-pruning facility, called Jmax-pruning, is proposed and evaluated empirically. A possible pre-pruning facility for TDIDT based on Jmax-pruning is also discussed.
Resumo:
Distributed and collaborative data stream mining in a mobile computing environment is referred to as Pocket Data Mining PDM. Large amounts of available data streams to which smart phones can subscribe to or sense, coupled with the increasing computational power of handheld devices motivates the development of PDM as a decision making system. This emerging area of study has shown to be feasible in an earlier study using technological enablers of mobile software agents and stream mining techniques [1]. A typical PDM process would start by having mobile agents roam the network to discover relevant data streams and resources. Then other (mobile) agents encapsulating stream mining techniques visit the relevant nodes in the network in order to build evolving data mining models. Finally, a third type of mobile agents roam the network consulting the mining agents for a final collaborative decision, when required by one or more users. In this paper, we propose the use of distributed Hoeffding trees and Naive Bayes classifers in the PDM framework over vertically partitioned data streams. Mobile policing, health monitoring and stock market analysis are among the possible applications of PDM. An extensive experimental study is reported showing the effectiveness of the collaborative data mining with the two classifers.
Resumo:
The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.
Resumo:
Pocket Data Mining (PDM) describes the full process of analysing data streams in mobile ad hoc distributed environments. Advances in mobile devices like smart phones and tablet computers have made it possible for a wide range of applications to run in such an environment. In this paper, we propose the adoption of data stream classification techniques for PDM. Evident by a thorough experimental study, it has been proved that running heterogeneous/different, or homogeneous/similar data stream classification techniques over vertically partitioned data (data partitioned according to the feature space) results in comparable performance to batch and centralised learning techniques.
Resumo:
In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.
Resumo:
Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.
Resumo:
The authors estimate climate warming–related twenty-first-century changes of moisture transports from the descending into the ascending regions in the tropics. Unlike previous studies that employ time and space averaging, here homogeneous high horizontal and vertical resolution data from an Intergovernmental Panel on Climate Change Fourth Assessment Report (IPCC AR4) climate model are used. This allows for estimating changes in much greater detail (e.g., the estimation of the distribution of ascending and descending regions, changes in the vertical profile, and separating changes of the inward and outward transports). Low-level inward and midlevel outward moisture transports of the convective regions in the tropics are found to increase in a simulated anthropogenically warmed climate as compared to a simulated twentieth-century atmosphere, indicating an intensification of the hydrological cycle. Since an increase of absolute inward transport exceeds the absolute increase of outward transport, the resulting budget is positive, meaning that more water is projected to converge in the moist tropics. The intensification is found mainly to be due to the higher amount of water in the atmosphere, while the contribution of weakening wind counteracts this response marginally. In addition the changing statistical properties of the vertical profile of the moisture transport are investigated and the importance of the substantial outflow of moisture from the moist tropics at midlevels is demonstrated.
Resumo:
Robust and physically understandable responses of the global atmospheric water cycle to a warming climate are presented. By considering interannual responses to changes in surface temperature (T), observations and AMIP5 simulations agree on an increase in column integrated water vapor at the rate 7 %/K (in line with the ClausiusClapeyron equation) and of precipitation at the rate 2-3 %/K (in line with energetic constraints). Using simple and complex climate models, we demonstrate that radiative forcing by greenhouse gases is currently suppressing global precipitation (P) at ~ -0.15 %/decade. Along with natural variability, this can explain why observed trends in global P over the period 1988-2008 are close to zero. Regional responses in the global water cycle are strongly constrained by changes in moisture fluxes. Model simulations show an increased moisture flux into the tropical wet region at 900 hPa and an enhanced outflow (of smaller magnitude) at around 600 hPa with warming. Moisture transport explains an increase in P in the wet tropical regions and small or negative changes in the dry regions of the subtropics in CMIP5 simulations of a warming climate. For AMIP5 simulations and satellite observations, the heaviest 5-day rainfall totals increase in intensity at ~15 %/K over the ocean with reductions at all percentiles over land. The climate change response in CMIP5 simulations shows consistent increases in P over ocean and land for the highest intensities, close to the Clausius-Clapeyron scaling of 7 %/K, while P declines for the lowest percentiles, indicating that interannual variability over land may not be a good proxy for climate change. The local changes in precipitation and its extremes are highly dependent upon small shifts in the large-scale atmospheric circulation and regional feedbacks.
Resumo:
Whilst hydrological systems can show resilience to short-term streamflow deficiencies during within-year droughts, prolonged deficits during multi-year droughts are a significant threat to water resources security in Europe. This study uses a threshold-based objective classification of regional hydrological drought to qualitatively examine the characteristics, spatio-temporal evolution and synoptic climatic drivers of multi-year drought events in 1962–64, 1975–76 and 1995–97, on a European scale but with particular focus on the UK. Whilst all three events are multi-year, pan-European phenomena, their development and causes can be contrasted. The critical factor in explaining the unprecedented severity of the 1975–76 event is the consecutive occurrence of winter and summer drought. In contrast, 1962–64 was a succession of dry winters, mitigated by quiescent summers, whilst 1995–97 lacked spatial coherence and was interrupted by wet interludes. Synoptic climatic conditions vary within and between multi-year droughts, suggesting that regional factors modulate the climate signal in streamflow drought occurrence. Despite being underpinned by qualitatively similar climatic conditions and commonalities in evolution and characteristics, each of the three droughts has a unique spatio-temporal signature. An improved understanding of the spatio-temporal evolution and characteristics of multi-year droughts has much to contribute to monitoring and forecasting capability, and to improved mitigation strategies.