90 resultados para Multi-relational data mining
Resumo:
Human brain imaging techniques, such as Magnetic Resonance Imaging (MRI) or Diffusion Tensor Imaging (DTI), have been established as scientific and diagnostic tools and their adoption is growing in popularity. Statistical methods, machine learning and data mining algorithms have successfully been adopted to extract predictive and descriptive models from neuroimage data. However, the knowledge discovery process typically requires also the adoption of pre-processing, post-processing and visualisation techniques in complex data workflows. Currently, a main problem for the integrated preprocessing and mining of MRI data is the lack of comprehensive platforms able to avoid the manual invocation of preprocessing and mining tools, that yields to an error-prone and inefficient process. In this work we present K-Surfer, a novel plug-in of the Konstanz Information Miner (KNIME) workbench, that automatizes the preprocessing of brain images and leverages the mining capabilities of KNIME in an integrated way. K-Surfer supports the importing, filtering, merging and pre-processing of neuroimage data from FreeSurfer, a tool for human brain MRI feature extraction and interpretation. K-Surfer automatizes the steps for importing FreeSurfer data, reducing time costs, eliminating human errors and enabling the design of complex analytics workflow for neuroimage data by leveraging the rich functionalities available in the KNIME workbench.
Resumo:
An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.
Resumo:
Systems Engineering often involves computer modelling the behaviour of proposed systems and their components. Where a component is human, fallibility must be modelled by a stochastic agent. The identification of a model of decision-making over quantifiable options is investigated using the game-domain of Chess. Bayesian methods are used to infer the distribution of players’ skill levels from the moves they play rather than from their competitive results. The approach is used on large sets of games by players across a broad FIDE Elo range, and is in principle applicable to any scenario where high-value decisions are being made under pressure.
Resumo:
Facilitating the visual exploration of scientific data has received increasing attention in the past decade or so. Especially in life science related application areas the amount of available data has grown at a breath taking pace. In this paper we describe an approach that allows for visual inspection of large collections of molecular compounds. In contrast to classical visualizations of such spaces we incorporate a specific focus of analysis, for example the outcome of a biological experiment such as high throughout screening results. The presented method uses this experimental data to select molecular fragments of the underlying molecules that have interesting properties and uses the resulting space to generate a two dimensional map based on a singular value decomposition algorithm and a self organizing map. Experiments on real datasets show that the resulting visual landscape groups molecules of similar chemical properties in densely connected regions.
Resumo:
The existence of endgame databases challenges us to extract higher-grade information and knowledge from their basic data content. Chess players, for example, would like simple and usable endgame theories if such holy grail exists: endgame experts would like to provide such insights and be inspired by computers to do so. Here, we investigate the use of artificial neural networks (NNs) to mine these databases and we report on a first use of NNs on KPK. The results encourage us to suggest further work on chess applications of neural networks and other data-mining techniques.
Resumo:
The van der Heijden Studies Database has been reviewed to identify 'Draw Studies' with sub-7-man positions in the main line which are not draws. The data-mining method is described. Some 1,500 studies were faulted, 700 for the first time: 14 of the more interesting faults are highlighted and discussed.
Resumo:
The ground surface net solar radiation is the energy that drives physical and chemical processes at the ground surface. In this paper, multi-spectral data from the Landsat-5 TM, topographic data from a gridded digital elevation model, field measurements, and the atmosphere model LOWTRAN 7 are used to estimate surface net solar radiation over the FIFE site. Firstly an improved method is presented and used for calculating total surface incoming radiation. Then, surface albedo is integrated from surface reflectance factors derived from remotely sensed data from Landsat-5 TM. Finally, surface net solar radiation is calculated by subtracting surface upwelling radiation from the total surface incoming radiation.
Resumo:
Many recent inverse scattering techniques have been designed for single frequency scattered fields in the frequency domain. In practice, however, the data is collected in the time domain. Frequency domain inverse scattering algorithms obviously apply to time-harmonic scattering, or nearly time-harmonic scattering, through application of the Fourier transform. Fourier transform techniques can also be applied to non-time-harmonic scattering from pulses. Our goal here is twofold: first, to establish conditions on the time-dependent waves that provide a correspondence between time domain and frequency domain inverse scattering via Fourier transforms without recourse to the conventional limiting amplitude principle; secondly, we apply the analysis in the first part of this work toward the extension of a particular scattering technique, namely the point source method, to scattering from the requisite pulses. Numerical examples illustrate the method and suggest that reconstructions from admissible pulses deliver superior reconstructions compared to straight averaging of multi-frequency data. Copyright (C) 2006 John Wiley & Sons, Ltd.
Resumo:
This is a review of progress in the Chess Endgame field. It includes news of the promulgation of Endgame Tables, their use, non-use and potential runtime creation. It includes news of data-mining achievements related to 7-man chess and to the field of Chess Studies. It includes news of an algorithm to create Endgame Tables for variants of the normal game of chess.
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
In any data mining applications, automated text and text and image retrieval of information is needed. This becomes essential with the growth of the Internet and digital libraries. Our approach is based on the latent semantic indexing (LSI) and the corresponding term-by-document matrix suggested by Berry and his co-authors. Instead of using deterministic methods to find the required number of first "k" singular triplets, we propose a stochastic approach. First, we use Monte Carlo method to sample and to build much smaller size term-by-document matrix (e.g. we build k x k matrix) from where we then find the first "k" triplets using standard deterministic methods. Second, we investigate how we can reduce the problem to finding the "k"-largest eigenvalues using parallel Monte Carlo methods. We apply these methods to the initial matrix and also to the reduced one. The algorithms are running on a cluster of workstations under MPI and results of the experiments arising in textual retrieval of Web documents as well as comparison of the stochastic methods proposed are presented. (C) 2003 IMACS. Published by Elsevier Science B.V. All rights reserved.
Resumo:
This paper is concerned with the selection of inputs for classification models based on ratios of measured quantities. For this purpose, all possible ratios are built from the quantities involved and variable selection techniques are used to choose a convenient subset of ratios. In this context, two selection techniques are proposed: one based on a pre-selection procedure and another based on a genetic algorithm. In an example involving the financial distress prediction of companies, the models obtained from ratios selected by the proposed techniques compare favorably to a model using ratios usually found in the financial distress literature.
Resumo:
Technology-enhanced or Computer Aided Learning (e-learning) can be institutionally integrated and supported by learning management systems or Virtual Learning Environments (VLEs) to offer efficiency gains, effectiveness and scalability of the e-leaning paradigm. However this can only be achieved through integration of pedagogically intelligent approaches and lesson preparation tools environment and VLE that is well accepted by both the students and teachers. This paper critically explores some of the issues relevant to scalable routinisation of e-learning at the tertiary level, typically first year university undergraduates, with the teaching of Relational Data Analysis (RDA), as supported by multimedia authoring, as a case study. The paper concludes that blended learning approaches which balance the deployment of e-learning with other modalities of learning delivery such as instructor–mediated group learning etc offer the most flexible and scalable route to e-learning but that this requires the graceful integration of platforms for multimedia production, distribution and delivery through advanced interactive spaces that provoke learner engagement and promote learning autonomy and group learning facilitated by a cooperative-creative learning environment that remains open to personal exploration of constructivist-constructionist pathways to learning.
Resumo:
This review of recent developments starts with the publication of Harold van der Heijden's Study Database Edition IV, John Nunn's second trilogy on the endgame, and a range of endgame tables (EGTs) to the DTC, DTZ and DTZ50 metrics. It then summarises data-mining work by Eiko Bleicher and Guy Haworth in 2010. This used CQL and pgn2fen to find some 3,000 EGT-faulted studies in the database above, and the Type A (value-critical) and Type B-DTM (DTM-depth-critical) zugzwangs in the mainlines of those studies. The same technique was used to mine Chessbase's BIG DATABASE 2010 to identify Type A/B zugzwangs, and to identify the pattern of value-concession and DTM-depth concession in sub-7-man play.
Resumo:
Peak picking is an early key step in MS data analysis. We compare three commonly used approaches to peak picking and discuss their merits by means of statistical analysis. Methods investigated encompass signal-to-noise ratio, continuous wavelet transform, and a correlation-based approach using a Gaussian template. Functionality of the three methods is illustrated and discussed in a practical context using a mass spectral data set created with MALDI-TOF technology. Sensitivity and specificity are investigated using a manually defined reference set of peaks. As an additional criterion, the robustness of the three methods is assessed by a perturbation analysis and illustrated using ROC curves.