77 resultados para data Mining


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Aircraft Maintenance, Repair and Overhaul (MRO) feedback commonly includes an engineer’s complex text-based inspection report. Capturing and normalizing the content of these textual descriptions is vital to cost and quality benchmarking, and provides information to facilitate continuous improvement of MRO process and analytics. As data analysis and mining tools requires highly normalized data, raw textual data is inadequate. This paper offers a textual-mining solution to efficiently analyse bulk textual feedback data. Despite replacement of the same parts and/or sub-parts, the actual service cost for the same repair is often distinctly different from similar previously jobs. Regular expression algorithms were incorporated with an aircraft MRO glossary dictionary in order to help provide additional information concerning the reason for cost variation. Professional terms and conventions were included within the dictionary to avoid ambiguity and improve the outcome of the result. Testing results show that most descriptive inspection reports can be appropriately interpreted, allowing extraction of highly normalized data. This additional normalized data strongly supports data analysis and data mining, whilst also increasing the accuracy of future quotation costing. This solution has been effectively used by a large aircraft MRO agency with positive results.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Human brain imaging techniques, such as Magnetic Resonance Imaging (MRI) or Diffusion Tensor Imaging (DTI), have been established as scientific and diagnostic tools and their adoption is growing in popularity. Statistical methods, machine learning and data mining algorithms have successfully been adopted to extract predictive and descriptive models from neuroimage data. However, the knowledge discovery process typically requires also the adoption of pre-processing, post-processing and visualisation techniques in complex data workflows. Currently, a main problem for the integrated preprocessing and mining of MRI data is the lack of comprehensive platforms able to avoid the manual invocation of preprocessing and mining tools, that yields to an error-prone and inefficient process. In this work we present K-Surfer, a novel plug-in of the Konstanz Information Miner (KNIME) workbench, that automatizes the preprocessing of brain images and leverages the mining capabilities of KNIME in an integrated way. K-Surfer supports the importing, filtering, merging and pre-processing of neuroimage data from FreeSurfer, a tool for human brain MRI feature extraction and interpretation. K-Surfer automatizes the steps for importing FreeSurfer data, reducing time costs, eliminating human errors and enabling the design of complex analytics workflow for neuroimage data by leveraging the rich functionalities available in the KNIME workbench.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Systems Engineering often involves computer modelling the behaviour of proposed systems and their components. Where a component is human, fallibility must be modelled by a stochastic agent. The identification of a model of decision-making over quantifiable options is investigated using the game-domain of Chess. Bayesian methods are used to infer the distribution of players’ skill levels from the moves they play rather than from their competitive results. The approach is used on large sets of games by players across a broad FIDE Elo range, and is in principle applicable to any scenario where high-value decisions are being made under pressure.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Facilitating the visual exploration of scientific data has received increasing attention in the past decade or so. Especially in life science related application areas the amount of available data has grown at a breath taking pace. In this paper we describe an approach that allows for visual inspection of large collections of molecular compounds. In contrast to classical visualizations of such spaces we incorporate a specific focus of analysis, for example the outcome of a biological experiment such as high throughout screening results. The presented method uses this experimental data to select molecular fragments of the underlying molecules that have interesting properties and uses the resulting space to generate a two dimensional map based on a singular value decomposition algorithm and a self organizing map. Experiments on real datasets show that the resulting visual landscape groups molecules of similar chemical properties in densely connected regions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The existence of endgame databases challenges us to extract higher-grade information and knowledge from their basic data content. Chess players, for example, would like simple and usable endgame theories if such holy grail exists: endgame experts would like to provide such insights and be inspired by computers to do so. Here, we investigate the use of artificial neural networks (NNs) to mine these databases and we report on a first use of NNs on KPK. The results encourage us to suggest further work on chess applications of neural networks and other data-mining techniques.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The van der Heijden Studies Database has been reviewed to identify 'Draw Studies' with sub-7-man positions in the main line which are not draws. The data-mining method is described. Some 1,500 studies were faulted, 700 for the first time: 14 of the more interesting faults are highlighted and discussed.