991 resultados para outlier detection


Relevância:

60.00% 60.00%

Publicador:

Resumo:

We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Model-based calibration of steady-state engine operation is commonly performed with highly parameterized empirical models that are accurate but not very robust, particularly when predicting highly nonlinear responses such as diesel smoke emissions. To address this problem, and to boost the accuracy of more robust non-parametric methods to the same level, GT-Power was used to transform the empirical model input space into multiple input spaces that simplified the input-output relationship and improved the accuracy and robustness of smoke predictions made by three commonly used empirical modeling methods: Multivariate Regression, Neural Networks and the k-Nearest Neighbor method. The availability of multiple input spaces allowed the development of two committee techniques: a 'Simple Committee' technique that used averaged predictions from a set of 10 pre-selected input spaces chosen by the training data and the "Minimum Variance Committee" technique where the input spaces for each prediction were chosen on the basis of disagreement between the three modeling methods. This latter technique equalized the performance of the three modeling methods. The successively increasing improvements resulting from the use of a single best transformed input space (Best Combination Technique), Simple Committee Technique and Minimum Variance Committee Technique were verified with hypothesis testing. The transformed input spaces were also shown to improve outlier detection and to improve k-Nearest Neighbor performance when predicting dynamic emissions with steady-state training data. An unexpected finding was that the benefits of input space transformation were unaffected by changes in the hardware or the calibration of the underlying GT-Power model.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

When considering data from many trials, it is likely that some of them present a markedly different intervention effect or exert an undue influence on the summary results. We develop a forward search algorithm for identifying outlying and influential studies in meta-analysis models. The forward search algorithm starts by fitting the hypothesized model to a small subset of likely outlier-free studies and proceeds by adding studies into the set one-by-one that are determined to be closest to the fitted model of the existing set. As each study is added to the set, plots of estimated parameters and measures of fit are monitored to identify outliers by sharp changes in the forward plots. We apply the proposed outlier detection method to two real data sets; a meta-analysis of 26 studies that examines the effect of writing-to-learn interventions on academic achievement adjusting for three possible effect modifiers, and a meta-analysis of 70 studies that compares a fluoride toothpaste treatment to placebo for preventing dental caries in children. A simple simulated example is used to illustrate the steps of the proposed methodology, and a small-scale simulation study is conducted to evaluate the performance of the proposed method. Copyright © 2016 John Wiley & Sons, Ltd.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the framework of the European Project for Ice Coring in Antarctica (EPICA), a comprehensive glaciological pre-site survey has been carried out on Amundsenisen, Dronning Maud Land, East Antarctica, in the past decade. Within this survey, four intermediate-depth ice cores and 13 snow pits were analyzed for their ionic composition and interpreted with respect to the spatial and temporal variability of volcanic sulphate deposition. The comparison of the non-sea-salt (nss)-sulphate peaks that are related to the well-known eruptions of Pinatubo and Cerro Hudson in AD 1991 revealed sulphate depositions of comparable size (15.8 ± 3.4 kg/km**2) in 11 snow pits. There is a tendency to higher annual concentrations for smaller snow-accumulation rates. The combination of seasonal sodium and annually resolved nss-sulphate records allowed the establishment of a time-scale derived by annual-layer counting over the last 2000 years and thus a detailed chronology of annual volcanic sulphate deposition. Using a robust outlier detection algorithm, 49 volcanic eruptions were identified between AD 165 and 1997. The dating uncertainty is ±3 years between AD 1997 and 1601, around ±5 years between AD 1601 and 1257, and increasing to ±24 years at AD 165, improving the accuracy of the volcanic chronology during the penultimate millennium considerably.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En este Trabajo de Fin de Máster se desarrollará un sistema de detección de fraude en pagos con tarjeta de crédito en tiempo real utilizando tecnologías de procesamiento distribuido. Concretamente se considerarán dos tecnologías: TIBCO, un conjunto de herramientas comerciales diseñadas para el procesamiento de eventos complejos, y Apache Spark, un sistema abierto para el procesamiento de datos en tiempo real. Además de implementar el sistema utilizando las dos tecnologías propuestas, un objetivo, otro objetivo de este Trabajo de Fin de Máster consiste en analizar y comparar estos dos sistemas implementados usados para procesamiento en tiempo real. Para la detección de fraude en pagos con tarjeta de crédito se aplicarán técnicas de aprendizaje máquina, concretamente del campo de anomaly/outlier detection. Como fuentes de datos que alimenten los sistemas, haremos uso de tecnologías de colas de mensajes como TIBCO EMS y Kafka. Los datos generados son enviados a estas colas para que los respectivos sistemas puedan procesarlos y aplicar el algoritmo de aprendizaje máquina, determinando si una nueva instancia es fraude o no. Ambos sistemas hacen uso de una base de datos MongoDB para almacenar los datos generados de forma pseudoaleatoria por los generadores de mensajes, correspondientes a movimientos de tarjetas de crédito. Estos movimientos posteriormente serán usados como conjunto de entrenamiento para el algoritmo de aprendizaje máquina.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Mathematical models are increasingly used in environmental science thus increasing the importance of uncertainty and sensitivity analyses. In the present study, an iterative parameter estimation and identifiability analysis methodology is applied to an atmospheric model – the Operational Street Pollution Model (OSPMr). To assess the predictive validity of the model, the data is split into an estimation and a prediction data set using two data splitting approaches and data preparation techniques (clustering and outlier detection) are analysed. The sensitivity analysis, being part of the identifiability analysis, showed that some model parameters were significantly more sensitive than others. The application of the determined optimal parameter values was shown to succesfully equilibrate the model biases among the individual streets and species. It was as well shown that the frequentist approach applied for the uncertainty calculations underestimated the parameter uncertainties. The model parameter uncertainty was qualitatively assessed to be significant, and reduction strategies were identified.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper provides a novel Exceptional Object Analysis for Finding Rare Environmental Events (EOAFREE). The major contribution of our EOAFREE method is that it proposes a general Improved Exceptional Object Analysis based on Noises (IEOAN) algorithm to efficiently detect and rank exceptional objects. Our IEOAN algorithm is more general than already known outlier detection algorithms to find exceptional objects that may be not on the border; and experimental study shows that our IEOAN algorithm is far more efficient than directly recursively using already known clustering algorithms that may not force every data instance to belong to a cluster to detect rare events. Another contribution is that it provides an approach to preprocess heterogeneous real world data through exploring domain knowledge, based on which it defines changes instead of the water data value itself as the input of the IEOAN algorithm to remove the geographical differences between any two sites and the temporal differences between any two years. The effectiveness of our EOAFREE method is demonstrated by a real world application - that is, to detect water pollution events from the water quality datasets of 93 sites distributed in 10 river basins in Victoria, Australia between 1975 and 2010. © 2012 Elsevier B.V..

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Although the hyper-plane based One-Class Support Vector Machine (OCSVM) and the hyper-spherical based Support Vector Data Description (SVDD) algorithms have been shown to be very effective in detecting outliers, their performance on noisy and unlabeled training data has not been widely studied. Moreover, only a few heuristic approaches have been proposed to set the different parameters of these methods in an unsupervised manner. In this paper, we propose two unsupervised methods for estimating the optimal parameter settings to train OCSVM and SVDD models, based on analysing the structure of the data. We show that our heuristic is substantially faster than existing parameter estimation approaches while its accuracy is comparable with supervised parameter learning methods, such as grid-search with crossvalidation on labeled data. In addition, our proposed approaches can be used to prepare a labeled data set for a OCSVM or a SVDD from unlabeled data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

An outlier removal based data cleaning technique is proposed to
clean manually pre-segmented human skin data in colour images.
The 3-dimensional colour data is projected onto three 2-dimensional
planes, from which outliers are removed. The cleaned 2 dimensional
data projections are merged to yield a 3D clean RGB data. This data
is finally used to build a look up table and a single Gaussian classifier
for the purpose of human skin detection in colour images.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Heterogeneity in lifetime data may be modelled by multiplying an individual's hazard by an unobserved frailty. We test for the presence of frailty of this kind in univariate and bivariate data with Weibull distributed lifetimes, using statistics based on the ordered Cox-Snell residuals from the null model of no frailty. The form of the statistics is suggested by outlier testing in the gamma distribution. We find through simulation that the sum of the k largest or k smallest order statistics, for suitably chosen k , provides a powerful test when the frailty distribution is assumed to be gamma or positive stable, respectively. We provide recommended values of k for sample sizes up to 100 and simple formulae for estimated critical values for tests at the 5% level.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Anomaly detection in resource constrained wireless networks is an important challenge for tasks such as intrusion detection, quality assurance and event monitoring applications. The challenge is to detect these interesting events or anomalies in a timely manner, while minimising energy consumption in the network. We propose a distributed anomaly detection architecture, which uses multiple hyperellipsoidal clusters to model the data at each sensor node, and identify global and local anomalies in the network. In particular, a novel anomaly scoring method is proposed to provide a score for each hyperellipsoidal model, based on how remote the ellipsoid is relative to their neighbours. We demonstrate using several synthetic and real datasets that our proposed scheme achieves a higher detection performance with a significant reduction in communication overhead in the network compared to centralised and existing schemes. © 2014 Elsevier Ltd.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The removal of noise and outliers from measurement signals is a major problem in jet engine health monitoring. Topical measurement signals found in most jet engines include low rotor speed, high rotor speed. fuel flow and exhaust gas temperature. Deviations in these measurements from a baseline 'good' engine are often called measurement deltas and the health signals used for fault detection, isolation, trending and data mining. Linear filters such as the FIR moving average filter and IIR exponential average filter are used in the industry to remove noise and outliers from the jet engine measurement deltas. However, the use of linear filters can lead to loss of critical features in the signal that can contain information about maintenance and repair events that could be used by fault isolation algorithms to determine engine condition or by data mining algorithms to learn valuable patterns in the data, Non-linear filters such as the median and weighted median hybrid filters offer the opportunity to remove noise and gross outliers from signals while preserving features. In this study. a comparison of traditional linear filters popular in the jet engine industry is made with the median filter and the subfilter weighted FIR median hybrid (SWFMH) filter. Results using simulated data with implanted faults shows that the SWFMH filter results in a noise reduction of over 60 per cent compared to only 20 per cent for FIR filters and 30 per cent for IIR filters. Preprocessing jet engine health signals using the SWFMH filter would greatly improve the accuracy of diagnostic systems. (C) 2002 Published by Elsevier Science Ltd.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Measured process data normally contain inaccuracies because the measurements are obtained using imperfect instruments. As well as random errors one can expect systematic bias caused by miscalibrated instruments or outliers caused by process peaks such as sudden power fluctuations. Data reconciliation is the adjustment of a set of process data based on a model of the process so that the derived estimates conform to natural laws. In this paper, techniques for the detection and identification of both systematic bias and outliers in dynamic process data are presented. A novel technique for the detection and identification of systematic bias is formulated and presented. The problem of detection, identification and elimination of outliers is also treated using a modified version of a previously available clustering technique. These techniques are also combined to provide a global dynamic data reconciliation (DDR) strategy. The algorithms presented are tested in isolation and in combination using dynamic simulations of two continuous stirred tank reactors (CSTR).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Detection of lane boundaries of a road based on the images or video taken by a video capturing device in a suburban environment is a challenging task. In this paper, a novel lane detection algorithm is proposed without considering camera parameters; which robustly detects lane boundaries in real-time especially for sub-urban roads. Initially, the proposed method fits the CIE L*a*b* transformed road chromaticity values (that is a* and b* values) to a bi-variate Gaussian model followed by the classification of road area based on Mahalanobis distance. Secondly, the classified road area acts as an arbitrary shaped region of interest (AROI) in order to extract blobs resulting from the filtered image by a two dimensional Gabor filter. This is considered as the first cue of images. Thirdly, another cue of images was employed in order to obtain an entropy image. Moreover, results from the color based image cue and entropy image cue were integrated following an outlier removing process. Finally, the correct road lane points are fitted with Bezier splines which act as control points that can form arbitrary shapes. The algorithm was implemented and experiments were carried out on sub-urban roads. The results show the effectiveness of the algorithm in producing more accurate lane boundaries on curvatures and other objects on the road.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The detection of lane boundaries on suburban streets using images obtained from video constitutes a challenging task. This is mainly due to the difficulties associated with estimating the complex geometric structure of lane boundaries, the quality of lane markings as a result of wear, occlusions by traffic, and shadows caused by road-side trees and structures. Most of the existing techniques for lane boundary detection employ a single visual cue and will only work under certain conditions and where there are clear lane markings. Also, better results are achieved when there are no other onroad objects present. This paper extends our previous work and discusses a novel lane boundary detection algorithm specifically addressing the abovementioned issues through the integration of two visual cues. The first visual cue is based on stripe-like features found on lane lines extracted using a two-dimensional symmetric Gabor filter. The second visual cue is based on a texture characteristic determined using the entropy measure of the predefined neighbourhood around a lane boundary line. The visual cues are then integrated using a rulebased classifier which incorporates a modified sequential covering algorithm to improve robustness. To separate lane boundary lines from other similar features, a road mask is generated using road chromaticity values estimated from CIE L*a*b* colour transformation. Extraneous points around lane boundary lines are then removed by an outlier removal procedure based on studentized residuals. The lane boundary lines are then modelled with Bezier spline curves. To validate the algorithm, extensive experimental evaluation was carried out on suburban streets and the results are presented.