775 resultados para Data Mining, Rough Sets, Multi-Dimension, Association Rules, Constraint
Resumo:
This is a review of progress in the Chess Endgame field. It includes news of the promulgation of Endgame Tables, their use, non-use and potential runtime creation. It includes news of data-mining achievements related to 7-man chess and to the field of Chess Studies. It includes news of an algorithm to create Endgame Tables for variants of the normal game of chess.
Resumo:
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
Resumo:
This work analyzes the use of linear discriminant models, multi-layer perceptron neural networks and wavelet networks for corporate financial distress prediction. Although simple and easy to interpret, linear models require statistical assumptions that may be unrealistic. Neural networks are able to discriminate patterns that are not linearly separable, but the large number of parameters involved in a neural model often causes generalization problems. Wavelet networks are classification models that implement nonlinear discriminant surfaces as the superposition of dilated and translated versions of a single "mother wavelet" function. In this paper, an algorithm is proposed to select dilation and translation parameters that yield a wavelet network classifier with good parsimony characteristics. The models are compared in a case study involving failed and continuing British firms in the period 1997-2000. Problems associated with over-parameterized neural networks are illustrated and the Optimal Brain Damage pruning technique is employed to obtain a parsimonious neural model. The results, supported by a re-sampling study, show that both neural and wavelet networks may be a valid alternative to classical linear discriminant models.
Resumo:
This review of recent developments starts with the publication of Harold van der Heijden's Study Database Edition IV, John Nunn's second trilogy on the endgame, and a range of endgame tables (EGTs) to the DTC, DTZ and DTZ50 metrics. It then summarises data-mining work by Eiko Bleicher and Guy Haworth in 2010. This used CQL and pgn2fen to find some 3,000 EGT-faulted studies in the database above, and the Type A (value-critical) and Type B-DTM (DTM-depth-critical) zugzwangs in the mainlines of those studies. The same technique was used to mine Chessbase's BIG DATABASE 2010 to identify Type A/B zugzwangs, and to identify the pattern of value-concession and DTM-depth concession in sub-7-man play.
Resumo:
In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.
Resumo:
Van der Heijden’s ENDGAME STUDY DATABASE IV, HHDBIV, is the definitive collection of 76,132 chess studies. The zugzwang position or zug, one in which the side to move would prefer not to, is a frequent theme in the literature of chess studies. In this third data-mining of HHDBIV, we report on the occurrence of sub-7-man zugs there as discovered by the use of CQL and Nalimov endgame tables (EGTs). We also mine those Zugzwang Studies in which a zug more significantly appears in both its White-to-move (wtm) and Black-to-move (btm) forms. We provide some illustrative and extreme examples of zugzwangs in studies.
Resumo:
Aircraft Maintenance, Repair and Overhaul (MRO) feedback commonly includes an engineer’s complex text-based inspection report. Capturing and normalizing the content of these textual descriptions is vital to cost and quality benchmarking, and provides information to facilitate continuous improvement of MRO process and analytics. As data analysis and mining tools requires highly normalized data, raw textual data is inadequate. This paper offers a textual-mining solution to efficiently analyse bulk textual feedback data. Despite replacement of the same parts and/or sub-parts, the actual service cost for the same repair is often distinctly different from similar previously jobs. Regular expression algorithms were incorporated with an aircraft MRO glossary dictionary in order to help provide additional information concerning the reason for cost variation. Professional terms and conventions were included within the dictionary to avoid ambiguity and improve the outcome of the result. Testing results show that most descriptive inspection reports can be appropriately interpreted, allowing extraction of highly normalized data. This additional normalized data strongly supports data analysis and data mining, whilst also increasing the accuracy of future quotation costing. This solution has been effectively used by a large aircraft MRO agency with positive results.
Resumo:
In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.
Resumo:
In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.
Resumo:
Induction of classification rules is one of the most important technologies in data mining. Most of the work in this field has concentrated on the Top Down Induction of Decision Trees (TDIDT) approach. However, alternative approaches have been developed such as the Prism algorithm for inducing modular rules. Prism often produces qualitatively better rules than TDIDT but suffers from higher computational requirements. We investigate approaches that have been developed to minimize the computational requirements of TDIDT, in order to find analogous approaches that could reduce the computational requirements of Prism.
Resumo:
Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element
Resumo:
The link between the Pacific/North American pattern (PNA) and the North Atlantic Oscillation (NAO) is investigated in reanalysis data (NCEP, ERA40) and multi-century CGCM runs for present day climate using three versions of the ECHAM model. PNA and NAO patterns and indices are determined via rotated principal component analysis on monthly mean 500 hPa geopotential height fields using the varimax criteria. On average, the multi-century CGCM simulations show a significant anti-correlation between PNA and NAO. Further, multi-decadal periods with significantly enhanced (high anti-correlation, active phase) or weakened (low correlations, inactive phase) coupling are found in all CGCMs. In the simulated active phases, the storm track activity near Newfoundland has a stronger link with the PNA variability than during the inactive phases. On average, the reanalysis datasets show no significant anti-correlation between PNA and NAO indices, but during the sub-period 1973–1994 a significant anti-correlation is detected, suggesting that the present climate could correspond to an inactive period as detected in the CGCMs. An analysis of possible physical mechanisms suggests that the link between the patterns is established by the baroclinic waves forming the North Atlantic storm track. The geopotential height anomalies associated with negative PNA phases induce an increased advection of warm and moist air from the Gulf of Mexico and cold air from Canada. Both types of advection contribute to increase baroclinicity over eastern North America and also to increase the low level latent heat content of the warm air masses. Thus, growth conditions for eddies at the entrance of the North Atlantic storm track are enhanced. Considering the average temporal development during winter for the CGCM, results show an enhanced Newfoundland storm track maximum in the early winter for negative PNA, followed by a downstream enhancement of the Atlantic storm track in the subsequent months. In active (passive) phases, this seasonal development is enhanced (suppressed). As the storm track over the central and eastern Atlantic is closely related to the NAO variability, this development can be explained by the shift of the NAO index to more positive values.
Resumo:
Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.
Resumo:
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.