991 resultados para outlier detection


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study discusses structural damage diagnosis of real steel truss bridges by measuring trafficinduced vibration of bridges and utilizing a damage indicator derived from linear system parameters of a time series model. On-site damage experiments were carried out on real steel truss bridges. Artificial damage was applied to the bridge by severing a truss member with a cutting machine.Vehicle-induced vibrations of the bridges before and after applying damagewere measured and used in structural damage diagnosis of the bridges. Changes in the damage indicator are detected by Mahalanobis-Taguchi system (MTS) which is one of multivariate outlier analyses. The damage indicator and outlier detection was successfully applied to detect anomalies in the steel truss bridges utilizing vehicle-induced vibrations. Observations through this study demonstrate feasibility of the proposed approach for real world applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper, addresses the problem of novelty detection in the case that the observed data is a mixture of a known 'background' process contaminated with an unknown other process, which generates the outliers, or novel observations. The framework we describe here is quite general, employing univariate classification with incomplete information, based on knowledge of the distribution (the 'probability density function', 'pdf') of the data generated by the 'background' process. The relative proportion of this 'background' component (the 'prior' 'background' 'probability), the 'pdf' and the 'prior' probabilities of all other components are all assumed unknown. The main contribution is a new classification scheme that identifies the maximum proportion of observed data following the known 'background' distribution. The method exploits the Kolmogorov-Smirnov test to estimate the proportions, and afterwards data are Bayes optimally separated. Results, demonstrated with synthetic data, show that this approach can produce more reliable results than a standard novelty detection scheme. The classification algorithm is then applied to the problem of identifying outliers in the SIC2004 data set, in order to detect the radioactive release simulated in the 'oker' data set. We propose this method as a reliable means of novelty detection in the emergency situation which can also be used to identify outliers prior to the application of a more general automatic mapping algorithm. © Springer-Verlag 2007.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Satellite-borne scatterometers are used to measure backscattered micro-wave radiation from the ocean surface. This data may be used to infer surface wind vectors where no direct measurements exist. Inherent in this data are outliers owing to aberrations on the water surface and measurement errors within the equipment. We present two techniques for identifying outliers using neural networks; the outliers may then be removed to improve models derived from the data. Firstly the generative topographic mapping (GTM) is used to create a probability density model; data with low probability under the model may be classed as outliers. In the second part of the paper, a sensor model with input-dependent noise is used and outliers are identified based on their probability under this model. GTM was successfully modified to incorporate prior knowledge of the shape of the observation manifold; however, GTM could not learn the double skinned nature of the observation manifold. To learn this double skinned manifold necessitated the use of a sensor model which imposes strong constraints on the mapping. The results using GTM with a fixed noise level suggested the noise level may vary as a function of wind speed. This was confirmed by experiments using a sensor model with input-dependent noise, where the variation in noise is most sensitive to the wind speed input. Both models successfully identified gross outliers with the largest differences between models occurring at low wind speeds. © 2003 Elsevier Science Ltd. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

DUE TO COPYRIGHT RESTRICTIONS ONLY AVAILABLE FOR CONSULTATION AT ASTON UNIVERSITY LIBRARY AND INFORMATION SERVICES WITH PRIOR ARRANGEMENT

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper explains some drawbacks on previous approaches for detecting influential observations in deterministic nonparametric data envelopment analysis models as developed by Yang et al. (Annals of Operations Research 173:89-103, 2010). For example efficiency scores and relative entropies obtained in this model are unimportant to outlier detection and the empirical distribution of all estimated relative entropies is not a Monte-Carlo approximation. In this paper we developed a new method to detect whether a specific DMU is truly influential and a statistical test has been applied to determine the significance level. An application for measuring efficiency of hospitals is used to show the superiority of this method that leads to significant advancements in outlier detection. © 2014 Springer Science+Business Media New York.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper presents a technique for the automated removal of noise from process execution logs. Noise is the result of data quality issues such as logging errors and manifests itself in the form of infrequent process behavior. The proposed technique generates an abstract representation of an event log as an automaton capturing the direct follows relations between event labels. This automaton is then pruned from arcs with low relative frequency and used to remove from the log those events not fitting the automaton, which are identified as outliers. The technique has been extensively evaluated on top of various auto- mated process discovery algorithms using both artificial logs with different levels of noise, as well as a variety of real-life logs. The results show that the technique significantly improves the quality of the discovered process model along fitness, appropriateness and simplicity, without negative effects on generalization. Further, the technique scales well to large and complex logs.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In questa tesi vengono analizzati gli algoritmi DistributedSolvingSet e LazyDistributedSolvingSet e verranno mostrati dei risultati sperimentali relativi al secondo.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Outliers are objects that show abnormal behavior with respect to their context or that have unexpected values in some of their parameters. In decision-making processes, information quality is of the utmost importance. In specific applications, an outlying data element may represent an important deviation in a production process or a damaged sensor. Therefore, the ability to detect these elements could make the difference between making a correct and an incorrect decision. This task is complicated by the large sizes of typical databases. Due to their importance in search processes in large volumes of data, researchers pay special attention to the development of efficient outlier detection techniques. This article presents a computationally efficient algorithm for the detection of outliers in large volumes of information. This proposal is based on an extension of the mathematical framework upon which the basic theory of detection of outliers, founded on Rough Set Theory, has been constructed. From this starting point, current problems are analyzed; a detection method is proposed, along with a computational algorithm that allows the performance of outlier detection tasks with an almost-linear complexity. To illustrate its viability, the results of the application of the outlier-detection algorithm to the concrete example of a large database are presented.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

High-dimensional problem domains pose significant challenges for anomaly detection. The presence of irrelevant features can conceal the presence of anomalies. This problem, known as the '. curse of dimensionality', is an obstacle for many anomaly detection techniques. Building a robust anomaly detection model for use in high-dimensional spaces requires the combination of an unsupervised feature extractor and an anomaly detector. While one-class support vector machines are effective at producing decision surfaces from well-behaved feature vectors, they can be inefficient at modelling the variation in large, high-dimensional datasets. Architectures such as deep belief networks (DBNs) are a promising technique for learning robust features. We present a hybrid model where an unsupervised DBN is trained to extract generic underlying features, and a one-class SVM is trained from the features learned by the DBN. Since a linear kernel can be substituted for nonlinear ones in our hybrid model without loss of accuracy, our model is scalable and computationally efficient. The experimental results show that our proposed model yields comparable anomaly detection performance with a deep autoencoder, while reducing its training and testing time by a factor of 3 and 1000, respectively.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The problem of unsupervised anomaly detection arises in a wide variety of practical applications. While one-class support vector machines have demonstrated their effectiveness as an anomaly detection technique, their ability to model large datasets is limited due to their memory and time complexity for training. To address this issue for supervised learning of kernel machines, there has been growing interest in random projection methods as an alternative to the computationally expensive problems of kernel matrix construction and support vector optimisation. In this paper we leverage the theory of nonlinear random projections and propose the Randomised One-class SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. Our empirical analysis on several real-life and synthetic datasets shows that our randomised 1SVM algorithm achieves comparable or better accuracy to deep autoen-coder and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Anomaly detection in a WSN is an important aspect of data analysis in order to identify data items that significantly differ from normal data. A characteristic of the data generated by a WSN is that the data distribution may alter over the lifetime of the network due to the changing nature of the phenomenon being observed. Anomaly detection techniques must be able to adapt to a non-stationary data distribution in order to perform optimally. In this survey, we provide a comprehensive overview of approaches to anomaly detection in a WSN and their operation in a non-stationary environment.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset. Results We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours. Conclusions We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis describes the development of a robust and novel prototype to address the data quality problems that relate to the dimension of outlier data. It thoroughly investigates the associated problems with regards to detecting, assessing and determining the severity of the problem of outlier data; and proposes granule-mining based alternative techniques to significantly improve the effectiveness of mining and assessing outlier data.