337 resultados para outliers


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Averaging is ubiquitous in many sciences, engineering, and everyday practice. The notions of the arithmetic, geometric, and harmonic means developed by the ancient Greeks are in widespread use today. When thinking of an average, most people would use arithmetic mean, “the average”, or perhaps its weighted version in order to associate the inputs with the degrees of importance. While this is certainly the simplest and most intuitive averaging function, its use is often not warranted. For example, when averaging the interest rates, it is the geometric and not the arithmetic mean which is the right method. On the other hand, the arithmetic mean can also be biased for a few extreme inputs, and hence can convey false meaning. This is the reason why real estate markets report the median and not the average prices (which could be biased by one or a few outliers), and why judges’ marks in some Olympic sports are trimmed of the smallest and the largest values.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Although the hyper-plane based One-Class Support Vector Machine (OCSVM) and the hyper-spherical based Support Vector Data Description (SVDD) algorithms have been shown to be very effective in detecting outliers, their performance on noisy and unlabeled training data has not been widely studied. Moreover, only a few heuristic approaches have been proposed to set the different parameters of these methods in an unsupervised manner. In this paper, we propose two unsupervised methods for estimating the optimal parameter settings to train OCSVM and SVDD models, based on analysing the structure of the data. We show that our heuristic is substantially faster than existing parameter estimation approaches while its accuracy is comparable with supervised parameter learning methods, such as grid-search with crossvalidation on labeled data. In addition, our proposed approaches can be used to prepare a labeled data set for a OCSVM or a SVDD from unlabeled data.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We measured the distribution in absolute magnitude - circular velocity space for a well-defined sample of 199 rotating galaxies of the Calar Alto Legacy Integral Field Area Survey (CALIFA) using their stellar kinematics. Our aim in this analysis is to avoid subjective selection criteria and to take volume and large-scale structure factors into account. Using stellar velocity fields instead of gas emission line kinematics allows including rapidly rotating early-type galaxies. Our initial sample contains 277 galaxies with available stellar velocity fields and growth curve r-band photometry. After rejecting 51 velocity fields that could not be modelled because of the low number of bins, foreground contamination, or significant interaction, we performed Markov chain Monte Carlo modelling of the velocity fields, from which we obtained the rotation curve and kinematic parameters and their realistic uncertainties. We performed an extinction correction and calculated the circular velocity v_circ accounting for the pressure support of a given galaxy. The resulting galaxy distribution on the M-r - v(circ) plane was then modelled as a mixture of two distinct populations, allowing robust and reproducible rejection of outliers, a significant fraction of which are slow rotators. The selection effects are understood well enough that we were able to correct for the incompleteness of the sample. The 199 galaxies were weighted by volume and large-scale structure factors, which enabled us to fit a volume-corrected Tully-Fisher relation (TFR). More importantly, we also provide the volume-corrected distribution of galaxies in the M_r - v_circ plane, which can be compared with cosmological simulations. The joint distribution of the luminosity and circular velocity space densities, representative over the range of -20 > M_r > -22 mag, can place more stringent constraints on the galaxy formation and evolution scenarios than linear TFR fit parameters or the luminosity function alone.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

¿What have we learnt from the 2006-2012 crisis, including events such as the subprime crisis, the bankruptcy of Lehman Brothers or the European sovereign debt crisis, among others? It is usually assumed that in firms that have a CDS quotation, this CDS is the key factor in establishing the credit premiumrisk for a new financial asset. Thus, the CDS is a key element for any investor in taking relative value opportunities across a firm’s capital structure. In the first chapter we study the most relevant aspects of the microstructure of the CDS market in terms of pricing, to have a clear idea of how this market works. We consider that such an analysis is a necessary point for establishing a solid base for the rest of the chapters in order to carry out the different empirical studies we perform. In its document “Basel III: A global regulatory framework for more resilient banks and banking systems”, Basel sets the requirement of a capital charge for credit valuation adjustment (CVA) risk in the trading book and its methodology for the computation for the capital requirement. This regulatory requirement has added extra pressure for in-depth knowledge of the CDS market and this motivates the analysis performed in this thesis. The problem arises in estimating of the credit risk premium for those counterparties without a directly quoted CDS in the market. How can we estimate the credit spread for an issuer without CDS? In addition to this, given the high volatility period in the credit market in the last few years and, in particular, after the default of Lehman Brothers on 15 September 2008, we observe the presence of big outliers in the distribution of credit spread in the different combinations of rating, industry and region. After an exhaustive analysis of the results from the different models studied, we have reached the following conclusions. It is clear that hierarchical regression models fit the data much better than those of non-hierarchical regression. Furthermore,we generally prefer the median model (50%-quantile regression) to the mean model (standard OLS regression) due to its robustness when assigning the price to a new credit asset without spread,minimizing the “inversion problem”. Finally, an additional fundamental reason to prefer the median model is the typical "right skewness" distribution of CDS spreads...

Relevância:

10.00% 10.00%

Publicador:

Resumo:

O conhecimento dos atributos químicos dos solos é um fator de grande relevância, visando a utilização racional de corretivos e fertilizante. Assim, neste trabalho estão sendo caracterizados ambientes da região Norte, Noroeste e Serrana do Estado do Rio de Janeiro, para fins de estimativas de carbono orgânico (Corg), capacidade de troca catiônica (CTC), pH em água, alumínio trocável (Al+3), nitrogênio, saturação por bases (V%) e fósforo. Tendo como objetivo específico à análise exploratória dos dados de fertilidade do solo das três regiões mais produtivas do Estado do Rio de Janeiro. Neste projeto foram usados os dados de solos sistematizados pela Embrapa Solos (Santos et al., 2005). Os solos analisados apresentam baixo pH em água e altos teores em Al+3, bem como baixas concentrações de P, N e C orgânico. Os valores de CTC e V (%) foram considerados bons para a fertilidade do solo. A análise exploratória dos dados identificou outliers e valores extremos, pela análise do sumário estatístico e dos gráficos box-plot das variáveis. A retirada destes últimos melhorou muito a consistência do conjunto remanescente, o que permite antever uma melhor qualidade dos resultados de interpolações por krigagem a serem realizadas e o próprio mapeamento digital da fertilidade, de acordo com McBratney et al. (2003). A análise exploratória mostrou-se útil para as próximas fases de mapeamento digital de solo-paisagem e a recomendação de adubação a ser proposta.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

El presente trabajo describe los diferentes componentes del programa CFA88 (Consolidated Frequency Analysis Package, versión 1). El programa le permite al usuario:1) ajustar cinco distribuciones de frecuencia( valor generalizado extremo, lognormal de tres parámetros, log Pearson tipo III, Wakeby y Weibull); 2) evaluar a través de pruebas no paramétricas los supuestos asociados al análisis de frecuencia, a saber: independencia, homogeneidad, aleatoriedad y tendencia en el tiempo;3) detectar valores <<fuera de lo común>> (low and high outliers); y 4) analizar series hidrometeorologicas con ceros, información histórica y eventos fuera de lo común.