62 resultados para Data mining methods
Resumo:
In any data mining applications, automated text and text and image retrieval of information is needed. This becomes essential with the growth of the Internet and digital libraries. Our approach is based on the latent semantic indexing (LSI) and the corresponding term-by-document matrix suggested by Berry and his co-authors. Instead of using deterministic methods to find the required number of first "k" singular triplets, we propose a stochastic approach. First, we use Monte Carlo method to sample and to build much smaller size term-by-document matrix (e.g. we build k x k matrix) from where we then find the first "k" triplets using standard deterministic methods. Second, we investigate how we can reduce the problem to finding the "k"-largest eigenvalues using parallel Monte Carlo methods. We apply these methods to the initial matrix and also to the reduced one. The algorithms are running on a cluster of workstations under MPI and results of the experiments arising in textual retrieval of Web documents as well as comparison of the stochastic methods proposed are presented. (C) 2003 IMACS. Published by Elsevier Science B.V. All rights reserved.
Resumo:
Peak picking is an early key step in MS data analysis. We compare three commonly used approaches to peak picking and discuss their merits by means of statistical analysis. Methods investigated encompass signal-to-noise ratio, continuous wavelet transform, and a correlation-based approach using a Gaussian template. Functionality of the three methods is illustrated and discussed in a practical context using a mass spectral data set created with MALDI-TOF technology. Sensitivity and specificity are investigated using a manually defined reference set of peaks. As an additional criterion, the robustness of the three methods is assessed by a perturbation analysis and illustrated using ROC curves.
Resumo:
In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.
Resumo:
Aircraft Maintenance, Repair and Overhaul (MRO) feedback commonly includes an engineer’s complex text-based inspection report. Capturing and normalizing the content of these textual descriptions is vital to cost and quality benchmarking, and provides information to facilitate continuous improvement of MRO process and analytics. As data analysis and mining tools requires highly normalized data, raw textual data is inadequate. This paper offers a textual-mining solution to efficiently analyse bulk textual feedback data. Despite replacement of the same parts and/or sub-parts, the actual service cost for the same repair is often distinctly different from similar previously jobs. Regular expression algorithms were incorporated with an aircraft MRO glossary dictionary in order to help provide additional information concerning the reason for cost variation. Professional terms and conventions were included within the dictionary to avoid ambiguity and improve the outcome of the result. Testing results show that most descriptive inspection reports can be appropriately interpreted, allowing extraction of highly normalized data. This additional normalized data strongly supports data analysis and data mining, whilst also increasing the accuracy of future quotation costing. This solution has been effectively used by a large aircraft MRO agency with positive results.
Resumo:
Data assimilation is predominantly used for state estimation; combining observational data with model predictions to produce an updated model state that most accurately approximates the true system state whilst keeping the model parameters fixed. This updated model state is then used to initiate the next model forecast. Even with perfect initial data, inaccurate model parameters will lead to the growth of prediction errors. To generate reliable forecasts we need good estimates of both the current system state and the model parameters. This paper presents research into data assimilation methods for morphodynamic model state and parameter estimation. First, we focus on state estimation and describe implementation of a three dimensional variational(3D-Var) data assimilation scheme in a simple 2D morphodynamic model of Morecambe Bay, UK. The assimilation of observations of bathymetry derived from SAR satellite imagery and a ship-borne survey is shown to significantly improve the predictive capability of the model over a 2 year run. Here, the model parameters are set by manual calibration; this is laborious and is found to produce different parameter values depending on the type and coverage of the validation dataset. The second part of this paper considers the problem of model parameter estimation in more detail. We explain how, by employing the technique of state augmentation, it is possible to use data assimilation to estimate uncertain model parameters concurrently with the model state. This approach removes inefficiencies associated with manual calibration and enables more effective use of observational data. We outline the development of a novel hybrid sequential 3D-Var data assimilation algorithm for joint state-parameter estimation and demonstrate its efficacy using an idealised 1D sediment transport model. The results of this study are extremely positive and suggest that there is great potential for the use of data assimilation-based state-parameter estimation in coastal morphodynamic modelling.
Resumo:
Data assimilation algorithms are a crucial part of operational systems in numerical weather prediction, hydrology and climate science, but are also important for dynamical reconstruction in medical applications and quality control for manufacturing processes. Usually, a variety of diverse measurement data are employed to determine the state of the atmosphere or to a wider system including land and oceans. Modern data assimilation systems use more and more remote sensing data, in particular radiances measured by satellites, radar data and integrated water vapor measurements via GPS/GNSS signals. The inversion of some of these measurements are ill-posed in the classical sense, i.e. the inverse of the operator H which maps the state onto the data is unbounded. In this case, the use of such data can lead to significant instabilities of data assimilation algorithms. The goal of this work is to provide a rigorous mathematical analysis of the instability of well-known data assimilation methods. Here, we will restrict our attention to particular linear systems, in which the instability can be explicitly analyzed. We investigate the three-dimensional variational assimilation and four-dimensional variational assimilation. A theory for the instability is developed using the classical theory of ill-posed problems in a Banach space framework. Further, we demonstrate by numerical examples that instabilities can and will occur, including an example from dynamic magnetic tomography.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.
Resumo:
The purpose of this lecture is to review recent development in data analysis, initialization and data assimilation. The development of 3-dimensional multivariate schemes has been very timely because of its suitability to handle the many different types of observations during FGGE. Great progress has taken place in the initialization of global models by the aid of non-linear normal mode technique. However, in spite of great progress, several fundamental problems are still unsatisfactorily solved. Of particular importance is the question of the initialization of the divergent wind fields in the Tropics and to find proper ways to initialize weather systems driven by non-adiabatic processes. The unsatisfactory ways in which such processes are being initialized are leading to excessively long spin-up times.
Resumo:
Background: Since their inception, Twitter and related microblogging systems have provided a rich source of information for researchers and have attracted interest in their affordances and use. Since 2009 PubMed has included 123 journal articles on medicine and Twitter, but no overview exists as to how the field uses Twitter in research. // Objective: This paper aims to identify published work relating to Twitter indexed by PubMed, and then to classify it. This classification will provide a framework in which future researchers will be able to position their work, and to provide an understanding of the current reach of research using Twitter in medical disciplines. Limiting the study to papers indexed by PubMed ensures the work provides a reproducible benchmark. // Methods: Papers, indexed by PubMed, on Twitter and related topics were identified and reviewed. The papers were then qualitatively classified based on the paper’s title and abstract to determine their focus. The work that was Twitter focused was studied in detail to determine what data, if any, it was based on, and from this a categorization of the data set size used in the studies was developed. Using open coded content analysis additional important categories were also identified, relating to the primary methodology, domain and aspect. // Results: As of 2012, PubMed comprises more than 21 million citations from biomedical literature, and from these a corpus of 134 potentially Twitter related papers were identified, eleven of which were subsequently found not to be relevant. There were no papers prior to 2009 relating to microblogging, a term first used in 2006. Of the remaining 123 papers which mentioned Twitter, thirty were focussed on Twitter (the others referring to it tangentially). The early Twitter focussed papers introduced the topic and highlighted the potential, not carrying out any form of data analysis. The majority of published papers used analytic techniques to sort through thousands, if not millions, of individual tweets, often depending on automated tools to do so. Our analysis demonstrates that researchers are starting to use knowledge discovery methods and data mining techniques to understand vast quantities of tweets: the study of Twitter is becoming quantitative research. // Conclusions: This work is to the best of our knowledge the first overview study of medical related research based on Twitter and related microblogging. We have used five dimensions to categorise published medical related research on Twitter. This classification provides a framework within which researchers studying development and use of Twitter within medical related research, and those undertaking comparative studies of research relating to Twitter in the area of medicine and beyond, can position and ground their work.
Resumo:
Remote sensing observations often have correlated errors, but the correlations are typically ignored in data assimilation for numerical weather prediction. The assumption of zero correlations is often used with data thinning methods, resulting in a loss of information. As operational centres move towards higher-resolution forecasting, there is a requirement to retain data providing detail on appropriate scales. Thus an alternative approach to dealing with observation error correlations is needed. In this article, we consider several approaches to approximating observation error correlation matrices: diagonal approximations, eigendecomposition approximations and Markov matrices. These approximations are applied in incremental variational assimilation experiments with a 1-D shallow water model using synthetic observations. Our experiments quantify analysis accuracy in comparison with a reference or ‘truth’ trajectory, as well as with analyses using the ‘true’ observation error covariance matrix. We show that it is often better to include an approximate correlation structure in the observation error covariance matrix than to incorrectly assume error independence. Furthermore, by choosing a suitable matrix approximation, it is feasible and computationally cheap to include error correlation structure in a variational data assimilation algorithm.
Resumo:
Data assimilation methods which avoid the assumption of Gaussian error statistics are being developed for geoscience applications. We investigate how the relaxation of the Gaussian assumption affects the impact observations have within the assimilation process. The effect of non-Gaussian observation error (described by the likelihood) is compared to previously published work studying the effect of a non-Gaussian prior. The observation impact is measured in three ways: the sensitivity of the analysis to the observations, the mutual information, and the relative entropy. These three measures have all been studied in the case of Gaussian data assimilation and, in this case, have a known analytical form. It is shown that the analysis sensitivity can also be derived analytically when at least one of the prior or likelihood is Gaussian. This derivation shows an interesting asymmetry in the relationship between analysis sensitivity and analysis error covariance when the two different sources of non-Gaussian structure are considered (likelihood vs. prior). This is illustrated for a simple scalar case and used to infer the effect of the non-Gaussian structure on mutual information and relative entropy, which are more natural choices of metric in non-Gaussian data assimilation. It is concluded that approximating non-Gaussian error distributions as Gaussian can give significantly erroneous estimates of observation impact. The degree of the error depends not only on the nature of the non-Gaussian structure, but also on the metric used to measure the observation impact and the source of the non-Gaussian structure.
Resumo:
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.
Resumo:
Classical regression methods take vectors as covariates and estimate the corresponding vectors of regression parameters. When addressing regression problems on covariates of more complex form such as multi-dimensional arrays (i.e. tensors), traditional computational models can be severely compromised by ultrahigh dimensionality as well as complex structure. By exploiting the special structure of tensor covariates, the tensor regression model provides a promising solution to reduce the model’s dimensionality to a manageable level, thus leading to efficient estimation. Most of the existing tensor-based methods independently estimate each individual regression problem based on tensor decomposition which allows the simultaneous projections of an input tensor to more than one direction along each mode. As a matter of fact, multi-dimensional data are collected under the same or very similar conditions, so that data share some common latent components but can also have their own independent parameters for each regression task. Therefore, it is beneficial to analyse regression parameters among all the regressions in a linked way. In this paper, we propose a tensor regression model based on Tucker Decomposition, which identifies not only the common components of parameters across all the regression tasks, but also independent factors contributing to each particular regression task simultaneously. Under this paradigm, the number of independent parameters along each mode is constrained by a sparsity-preserving regulariser. Linked multiway parameter analysis and sparsity modeling further reduce the total number of parameters, with lower memory cost than their tensor-based counterparts. The effectiveness of the new method is demonstrated on real data sets.
Resumo:
An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.