337 resultados para outliers


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we propose a semi-supervised approach of anomaly detection in Online Social Networks. The social network is modeled as a graph and its features are extracted to detect anomaly. A clustering algorithm is then used to group users based on these features and fuzzy logic is applied to assign degree of anomalous behavior to the users of these clusters. Empirical analysis shows effectiveness of this method.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a technique for the automated removal of noise from process execution logs. Noise is the result of data quality issues such as logging errors and manifests itself in the form of infrequent process behavior. The proposed technique generates an abstract representation of an event log as an automaton capturing the direct follows relations between event labels. This automaton is then pruned from arcs with low relative frequency and used to remove from the log those events not fitting the automaton, which are identified as outliers. The technique has been extensively evaluated on top of various auto- mated process discovery algorithms using both artificial logs with different levels of noise, as well as a variety of real-life logs. The results show that the technique significantly improves the quality of the discovered process model along fitness, appropriateness and simplicity, without negative effects on generalization. Further, the technique scales well to large and complex logs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most information in linkage analysis for quantitative traits comes from pairs of relatives that are phenotypically most discordant or concordant. Confounding this, within-family outliers from non-genetic causes may create false positives and negatives. We investigated the influence of within-family outliers empirically, using one of the largest genome-wide linkage scans for height. The subjects were drawn from Australian twin cohorts consisting of 8447 individuals in 2861 families, providing a total of 5815 possible pairs of siblings in sibships. A variance component linkage analysis was performed, either including or excluding the within-family outliers. Using the entire dataset, the largest LOD scores were on chromosome 15q (LOD 2.3) and 11q (1.5). Excluding within-family outliers increased the LOD score for most regions, but the LOD score on chromosome 15 decreased from 2.3 to 1.2, suggesting that the outliers may create false negatives and false positives, although rare alleles of large effect may also be an explanation. Several regions suggestive of linkage to height were found after removing the outliers, including 1q23.1 (2.0), 3q22.1 (1.9) and 5q32 (2.3). We conclude that the investigation of the effect of within-family outliers, which is usually neglected, should be a standard quality control measure in linkage analysis for complex traits and may reduce the noise for the search of common variants of modest effect size as well as help identify rare variants of large effect and clinical significance. We suggest that the effect of within-family outliers deserves further investigation via theoretical and simulation studies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Learning an input-output mapping from a set of examples can be regarded as synthesizing an approximation of a multi-dimensional function. From this point of view, this form of learning is closely related to regularization theory. In this note, we extend the theory by introducing ways of dealing with two aspects of learning: learning in the presence of unreliable examples and learning from positive and negative examples. The first extension corresponds to dealing with outliers among the sparse data. The second one corresponds to exploiting information about points or regions in the range of the function that are forbidden.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Liu, Yonghuai, Liu, Honghai, Li, Longzhuang, Wei, Baogang. Accurate Range Image Registration: Eliminating or Modelling Outliers. Proceedings of 12th IEEE Conference on Emerging Technologies and Factory Automation, 2007, pp. 1316-1323. Sponsorship: IEEE

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis Entitled Bayesian inference in Exponential and pareto populations in the presence of outliers. The main theme of the present thesis is focussed on various estimation problems using the Bayesian appraoch, falling under the general category of accommodation procedures for analysing Pareto data containing outlier. In Chapter II. the problem of estimation of parameters in the classical Pareto distribution specified by the density function. In Chapter IV. we discuss the estimation of (1.19) when the sample contain a known number of outliers under three different data generating mechanisms, viz. the exchangeable model. Chapter V the prediction of a future observation based on a random sample that contains one contaminant. Chapter VI is devoted to the study of estimation problems concerning the exponential parameters under a k-outlier model.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. The first paper of this series examined the effects of the former on the variogram and this paper examines the effects of asymmetry arising from outliers. Simulated annealing was used to create normally distributed random fields of different size that are realizations of known processes described by variograms with different nugget:sill ratios. These primary data sets were then contaminated with randomly located and spatially aggregated outliers from a secondary process to produce different degrees of asymmetry. Experimental variograms were computed from these data by Matheron's estimator and by three robust estimators. The effects of standard data transformations on the coefficient of skewness and on the variogram were also investigated. Cross-validation was used to assess the performance of models fitted to experimental variograms computed from a range of data contaminated by outliers for kriging. The results showed that where skewness was caused by outliers the variograms retained their general shape, but showed an increase in the nugget and sill variances and nugget:sill ratios. This effect was only slightly more for the smallest data set than for the two larger data sets and there was little difference between the results for the latter. Overall, the effect of size of data set was small for all analyses. The nugget:sill ratio showed a consistent decrease after transformation to both square roots and logarithms; the decrease was generally larger for the latter, however. Aggregated outliers had different effects on the variogram shape from those that were randomly located, and this also depended on whether they were aggregated near to the edge or the centre of the field. The results of cross-validation showed that the robust estimators and the removal of outliers were the most effective ways of dealing with outliers for variogram estimation and kriging. (C) 2007 Elsevier Ltd. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An ensemble forecast is a collection of runs of a numerical dynamical model, initialized with perturbed initial conditions. In modern weather prediction for example, ensembles are used to retrieve probabilistic information about future weather conditions. In this contribution, we are concerned with ensemble forecasts of a scalar quantity (say, the temperature at a specific location). We consider the event that the verification is smaller than the smallest, or larger than the largest ensemble member. We call these events outliers. If a K-member ensemble accurately reflected the variability of the verification, outliers should occur with a base rate of 2/(K + 1). In operational forecast ensembles though, this frequency is often found to be higher. We study the predictability of outliers and find that, exploiting information available from the ensemble, forecast probabilities for outlier events can be calculated which are more skilful than the unconditional base rate. We prove this analytically for statistically consistent forecast ensembles. Further, the analytical results are compared to the predictability of outliers in an operational forecast ensemble by means of model output statistics. We find the analytical and empirical results to agree both qualitatively and quantitatively.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Multibiometrics aims at improving biometric security in presence of spoofing attempts, but exposes a larger availability of points of attack. Standard fusion rules have been shown to be highly sensitive to spoofing attempts – even in case of a single fake instance only. This paper presents a novel spoofing-resistant fusion scheme proposing the detection and elimination of anomalous fusion input in an ensemble of evidence with liveness information. This approach aims at making multibiometric systems more resistant to presentation attacks by modeling the typical behaviour of human surveillance operators detecting anomalies as employed in many decision support systems. It is shown to improve security, while retaining the high accuracy level of standard fusion approaches on the latest Fingerprint Liveness Detection Competition (LivDet) 2013 dataset.