838 resultados para text and data mining
Resumo:
GODIVA2 is a dynamic website that provides visual access to several terabytes of physically distributed, four-dimensional environmental data. It allows users to explore large datasets interactively without the need to install new software or download and understand complex data. Through the use of open international standards, GODIVA2 maintains a high level of interoperability with third-party systems, allowing diverse datasets to be mutually compared. Scientists can use the system to search for features in large datasets and to diagnose the output from numerical simulations and data processing algorithms. Data providers around Europe have adopted GODIVA2 as an INSPIRE-compliant dynamic quick-view system for providing visual access to their data.
Resumo:
Trace elements may present an environmental hazard in the vicinity of mining and smelting activities. However, the factors controlling trace element distribution in soils around ancient and modem mining and smelting areas are not always clear. Tharsis, Riotinto and Huelva are located in the Iberian Pyrite Belt in SW Spain. Tharsis and Riotinto mines have been exploited since 2500 B.C., with intensive smelting taking place. Huelva, established in 1970 and using the Flash Furnace Outokumpu process, is currently one of the largest smelter in the world. Pyrite and chalcopyrite ore have been intensively smelted for Cu. However, unusually for smelters and mines of a similar size, the elevated trace element concentrations in soils were found to be restricted to the immediate vicinity of the mines and smelters, being found up to a maximum of 2 kin from the mines and smelters at Tharsis, Riotinto and Huelva. Trace element partitioning (over 2/3 of trace elements found in the residual immobile fraction of soils at Tharsis) and soil particles examination by SEM-EDX showed that trace elements were not adsorbed onto soil particles, but were included within the matrix of large trace element-rich Fe silicate slag particles (i.e. 1 min circle divide at least 1 wt.% As, Cu and Zn, and 2 wt.% Pb). Slag particle large size (I mm 0) was found to control the geographically restricted trace element distribution in soils at Tharsis, Riotinto and Huelva, since large heavy particles could not have been transported long distances. Distribution and partitioning indicated that impacts to the environment as a result of mining and smelting should remain minimal in the region. (c) 2006 Elsevier B.V. All rights reserved.
Resumo:
Facilitating the visual exploration of scientific data has received increasing attention in the past decade or so. Especially in life science related application areas the amount of available data has grown at a breath taking pace. In this paper we describe an approach that allows for visual inspection of large collections of molecular compounds. In contrast to classical visualizations of such spaces we incorporate a specific focus of analysis, for example the outcome of a biological experiment such as high throughout screening results. The presented method uses this experimental data to select molecular fragments of the underlying molecules that have interesting properties and uses the resulting space to generate a two dimensional map based on a singular value decomposition algorithm and a self organizing map. Experiments on real datasets show that the resulting visual landscape groups molecules of similar chemical properties in densely connected regions.
Resumo:
Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.
Resumo:
The van der Heijden Studies Database has been reviewed to identify 'Draw Studies' with sub-7-man positions in the main line which are not draws. The data-mining method is described. Some 1,500 studies were faulted, 700 for the first time: 14 of the more interesting faults are highlighted and discussed.
Resumo:
This is a review of progress in the Chess Endgame field. It includes news of the promulgation of Endgame Tables, their use, non-use and potential runtime creation. It includes news of data-mining achievements related to 7-man chess and to the field of Chess Studies. It includes news of an algorithm to create Endgame Tables for variants of the normal game of chess.
Progress on “Changing coastlines: data assimilation for morphodynamic prediction and predictability”
Resumo:
The task of assessing the likelihood and extent of coastal flooding is hampered by the lack of detailed information on near-shore bathymetry. This is required as an input for coastal inundation models, and in some cases the variability in the bathymetry can impact the prediction of those areas likely to be affected by flooding in a storm. The constant monitoring and data collection that would be required to characterise the near-shore bathymetry over large coastal areas is impractical, leaving the option of running morphodynamic models to predict the likely bathymetry at any given time. However, if the models are inaccurate the errors may be significant if incorrect bathymetry is used to predict possible flood risks. This project is assessing the use of data assimilation techniques to improve the predictions from a simple model, by rigorously incorporating observations of the bathymetry into the model, to bring the model closer to the actual situation. Currently we are concentrating on Morecambe Bay as a primary study site, as it has a highly dynamic inter-tidal zone, with changes in the course of channels in this zone impacting the likely locations of flooding from storms. We are working with SAR images, LiDAR, and swath bathymetry to give us the observations over a 2.5 year period running from May 2003 – November 2005. We have a LiDAR image of the entire inter-tidal zone for November 2005 to use as validation data. We have implemented a 3D-Var data assimilation scheme, to investigate the improvements in performance of the data assimilation compared to the previous scheme which was based on the optimal interpolation method. We are currently evaluating these different data assimilation techniques, using 22 SAR data observations. We will also include the LiDAR data and swath bathymetry to improve the observational coverage, and investigate the impact of different types of observation on the predictive ability of the model. We are also assessing the ability of the data assimilation scheme to recover the correct bathymetry after storm events, which can dramatically change the bathymetry in a short period of time.
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
Event-related functional magnetic resonance imaging (efMRI) has emerged as a powerful technique for detecting brains' responses to presented stimuli. A primary goal in efMRI data analysis is to estimate the Hemodynamic Response Function (HRF) and to locate activated regions in human brains when specific tasks are performed. This paper develops new methodologies that are important improvements not only to parametric but also to nonparametric estimation and hypothesis testing of the HRF. First, an effective and computationally fast scheme for estimating the error covariance matrix for efMRI is proposed. Second, methodologies for estimation and hypothesis testing of the HRF are developed. Simulations support the effectiveness of our proposed methods. When applied to an efMRI dataset from an emotional control study, our method reveals more meaningful findings than the popular methods offered by AFNI and FSL. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
In any data mining applications, automated text and text and image retrieval of information is needed. This becomes essential with the growth of the Internet and digital libraries. Our approach is based on the latent semantic indexing (LSI) and the corresponding term-by-document matrix suggested by Berry and his co-authors. Instead of using deterministic methods to find the required number of first "k" singular triplets, we propose a stochastic approach. First, we use Monte Carlo method to sample and to build much smaller size term-by-document matrix (e.g. we build k x k matrix) from where we then find the first "k" triplets using standard deterministic methods. Second, we investigate how we can reduce the problem to finding the "k"-largest eigenvalues using parallel Monte Carlo methods. We apply these methods to the initial matrix and also to the reduced one. The algorithms are running on a cluster of workstations under MPI and results of the experiments arising in textual retrieval of Web documents as well as comparison of the stochastic methods proposed are presented. (C) 2003 IMACS. Published by Elsevier Science B.V. All rights reserved.
Resumo:
This work analyzes the use of linear discriminant models, multi-layer perceptron neural networks and wavelet networks for corporate financial distress prediction. Although simple and easy to interpret, linear models require statistical assumptions that may be unrealistic. Neural networks are able to discriminate patterns that are not linearly separable, but the large number of parameters involved in a neural model often causes generalization problems. Wavelet networks are classification models that implement nonlinear discriminant surfaces as the superposition of dilated and translated versions of a single "mother wavelet" function. In this paper, an algorithm is proposed to select dilation and translation parameters that yield a wavelet network classifier with good parsimony characteristics. The models are compared in a case study involving failed and continuing British firms in the period 1997-2000. Problems associated with over-parameterized neural networks are illustrated and the Optimal Brain Damage pruning technique is employed to obtain a parsimonious neural model. The results, supported by a re-sampling study, show that both neural and wavelet networks may be a valid alternative to classical linear discriminant models.
Resumo:
This review of recent developments starts with the publication of Harold van der Heijden's Study Database Edition IV, John Nunn's second trilogy on the endgame, and a range of endgame tables (EGTs) to the DTC, DTZ and DTZ50 metrics. It then summarises data-mining work by Eiko Bleicher and Guy Haworth in 2010. This used CQL and pgn2fen to find some 3,000 EGT-faulted studies in the database above, and the Type A (value-critical) and Type B-DTM (DTM-depth-critical) zugzwangs in the mainlines of those studies. The same technique was used to mine Chessbase's BIG DATABASE 2010 to identify Type A/B zugzwangs, and to identify the pattern of value-concession and DTM-depth concession in sub-7-man play.
Resumo:
Peak picking is an early key step in MS data analysis. We compare three commonly used approaches to peak picking and discuss their merits by means of statistical analysis. Methods investigated encompass signal-to-noise ratio, continuous wavelet transform, and a correlation-based approach using a Gaussian template. Functionality of the three methods is illustrated and discussed in a practical context using a mass spectral data set created with MALDI-TOF technology. Sensitivity and specificity are investigated using a manually defined reference set of peaks. As an additional criterion, the robustness of the three methods is assessed by a perturbation analysis and illustrated using ROC curves.
Resumo:
Depreciation is a key element of understanding the returns from and price of commercial real estate. Understanding its impact is important for asset allocation models and asset management decisions. It is a key input into well-constructed pricing models and its impact on indices of commercial real estate prices needs to be recognised. There have been a number of previous studies of the impact of depreciation on real estate, particularly in the UK. Law (2004) analysed all of these studies and found that the seemingly consistent results were an illusion as they all used a variety of measurement methods and data. In addition, none of these studies examined impact on total returns; they examined either rental value depreciation alone or rental and capital value depreciation. This study seeks to rectify this omission, adopting the best practice measurement framework set out by Law (2004). Using individual property data from the UK Investment Property Databank for the 10-year period between 1994 and 2003, rental and capital depreciation, capital expenditure rates, and total return series for the data sample and for a benchmark are calculated for 10 market segments. The results are complicated by the period of analysis which started in the aftermath of the major UK real estate recession of the early 1990s, but they give important insights into the impact of depreciation in different segments of the UK real estate investment market.