337 resultados para outliers


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Lipocalins constitute a superfamily of extracellular proteins that are found in all three kingdoms of life. Although very divergent in their sequences and functions, they show remarkable similarity in 3-D structures. Lipocalins bind and transport small hydrophobic molecules. Earlier sequence-based phylogenetic studies of lipocalins highlighted that they have a long evolutionary history. However the molecular and structural basis of their functional diversity is not completely understood. The main objective of the present study is to understand functional diversity of the lipocalins using a structure-based phylogenetic approach. The present study with 39 protein domains from the lipocalin superfamily suggests that the clusters of lipocalins obtained by structure-based phylogeny correspond well with the functional diversity. The detailed analysis on each of the clusters and sub-clusters reveals that the 39 lipocalin domains cluster based on their mode of ligand binding though the clustering was performed on the basis of gross domain structure. The outliers in the phylogenetic tree are often from single member families. Also structure-based phylogenetic approach has provided pointers to assign putative function for the domains of unknown function in lipocalin family. The approach employed in the present study can be used in the future for the functional identification of new lipocalin proteins and may be extended to other protein families where members show poor sequence similarity but high structural similarity.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a distributed sequential algorithm for quick detection of spectral holes in a Cognitive Radio set up. Two or more local nodes make decisions and inform the fusion centre (FC) over a reporting Multiple Access Channel (MAC), which then makes the final decision. The local nodes use energy detection and the FC uses mean detection in the presence of fading, heavy-tailed electromagnetic interference (EMI) and outliers. The statistics of the primary signal, channel gain and the EMI is not known. Different nonparametric sequential algorithms are compared to choose appropriate algorithms to be used at the local nodes and the Fe. Modification of a recently developed random walk test is selected for the local nodes for energy detection as well as at the fusion centre for mean detection. We show via simulations and analysis that the nonparametric distributed algorithm developed performs well in the presence of fading, EMI and outliers. The algorithm is iterative in nature making the computation and storage requirements minimal.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, we propose a vision based mobile robot localization strategy. Local scale-invariant features are used as natural landmarks in unstructured and unmodified environment. The local characteristics of the features we use prove to be robust to occlusion and outliers. In addition, the invariance of the features to viewpoint change makes them suitable landmarks for mobile robot localization. Scale-invariant features detected in the first exploration are indexed into a location database. Indexing and voting allow efficient recognition of global localization. The localization result is verified by epipolar geometry between the representative view in database and the view to be localized, thus the probability of false localization will be decreased. The localization system can recover the pose of the camera mounted on the robot by essential matrix decomposition. Then the position of the robot can be computed easily. Both calibrated and un-calibrated cases are discussed and relative position estimation based on calibrated camera turns out to be the better choice. Experimental results show that our approach is effective and reliable in the case of illumination changes, similarity transformations and extraneous features. © 2004 IEEE.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The implementation of the precautionary approach in the mid-1990s required commercial fish stocks to be classified into different categories. These are based on the degree to which stocks have been exploited or are threatened by fishing activities. According to current ICES terminology, stocks are classified as being either “within” or “outside safe biological limits”, or as being “harvested outside safe biological limits”. Between 1996 and 2002, the relative share of stocks in these three categories remained relatively stable (at about 20 %, 30 % and 15 %, respectively). Over the same time span, the number of stocks were insufficient data is available to quantify and thus to appropriately classify the state of the spawning stock biomass (“status unknown”) has increased. Neglecting potential impacts of fishing pressure, the combined average proportion of all stocks with sufficiently high spawning stock biomass is at about one third, while only one fifth of the stocks assessed have been managed sustainably. For some important fish stocks in the ICES environment – specifically demersal ones –, science recently had to call for rebuilding plans or even a closure of the fishery to allow recovery, in spite of the management’s agreement to manage the resources according to the precautionary approach. This obvious difference between approach and implementation has a number of potential causes: erroneous or imprecise input data (landings, discard and sampling information), insufficient assessment models, problems in the understanding of the scientific advice, and implementation errors. The latter could be either a difference between advised and implemented total allowable catches (TACs), or an excess of legal TACs. During the fifteen years covered by this analysis (1987 to 2002), the average deviation between the implemented TACs for a specific stock and that recommended by ICES for the same stock was more than 30 %. The overall average deviation (summed over all stocks) for the entire period was 34 %, excluding, however, four extreme outliers in the data, representing cases in which scientific recommendations were exceeded by as much as 1000 to 2500 %. If these were included, the overall average would be as high as 45 %. The annual deviation has substantially increased in recent years (from roughly 20 % in earlier years of the surveyed period). This recently observed high deviation also matches ICES’s estimate that the fishing mortality in the ICES convention area in the 1990s was well above recommended sustainable levels in the pelagic and demersal fishery. A direct comparison of scientifically proposed and politically implemented TACs is problematic in many case

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Daily sea surface temperatures have been acquired at the Hopkins Marine Station in Pacific Grove, California since January 20, 1919.This time series is one of the longest oceanographic records along the U.S. west coast. Because of its length it is well-suited for studying climate-related and oceanic variability on interannual, decadal, and interdecadal time scales. The record, however, is not homogeneous, has numerous gaps, contains possible outliers, and the observations were not always collected at the same time each day. Because of these problems we have undertaken the task of reconstructing this long and unique series. We describe the steps that were taken and the methods that were used in this reconstruction. Although the methods employed are basic, we believe that they are consistent with the quality of the data. The reconstructed record has values at every time point, original, or estimated, and has been adjusted for time-of-day variations where this information was available. Possible outliers have also been examined and replaced where their credibility could not be established. Many of the studies that have employed the Hopkins time series have not discussed the issue of data quality and how these problems were addressed. Because of growing interest in this record, it is important that a single, well-documented version be adopted, so that the results of future analyses can be directly compared. Although additional work may be done to further improve the quality of this record, it is now available via the internet. [PDF contains 48 pages]

Relevância:

10.00% 10.00%

Publicador:

Resumo:

12 samples (6 original samples and 6 diluted samples) were analysed by 14 WEFTA laboratories for their pH values in an inter-laboratory comparison exercise. As a result it can be stated that the majority of participating laboratories could determine the pH values very exactly. The pH values obtained are ranging only little around the calculated mean (less than 0.1 pH unit). It could also be demonstrated that the participating institutes could analyse both, pH values in fishery products and aqueous salt solutions. However, also in this exercise a number of outliers and deviating values have been detected. Therefore it is of utmost importance to calibrate the pH electrodes in regular intervals and to maintain them carefully. Intra-laboratory comparison measurements are recommended to detect weak points.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In the problem of one-class classification (OCC) one of the classes, the target class, has to be distinguished from all other possible objects, considered as nontargets. In many biomedical problems this situation arises, for example, in diagnosis, image based tumor recognition or analysis of electrocardiogram data. In this paper an approach to OCC based on a typicality test is experimentally compared with reference state-of-the-art OCC techniques-Gaussian, mixture of Gaussians, naive Parzen, Parzen, and support vector data description-using biomedical data sets. We evaluate the ability of the procedures using twelve experimental data sets with not necessarily continuous data. As there are few benchmark data sets for one-class classification, all data sets considered in the evaluation have multiple classes. Each class in turn is considered as the target class and the units in the other classes are considered as new units to be classified. The results of the comparison show the good performance of the typicality approach, which is available for high dimensional data; it is worth mentioning that it can be used for any kind of data (continuous, discrete, or nominal), whereas state-of-the-art approaches application is not straightforward when nominal variables are present.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

O câncer de próstata é a neoplasia mais incidente entre os homens brasileiros. Atualmente, grande parte destes tumores é confinada à próstata no momento do diagnóstico. No entanto, muitos tumores clinicamente classificados como localizados não o são de fato, levando a indicações terapêuticas curativas não efetivas. Por outro lado, muitos pacientes com câncer sem significância clínica são tratados desnecessariamente em função da limitação prognóstica do estadiamento clínicos (pré-tratamento) de pacientes com diagnóstico histológico de adenocarcinoma de próstata localizado (estágios I e II), em coorte hospitalar composta por pacientes tratados no Instituto Nacional de Câncer, Rio de Janeiro, matriculados entre 1990 a 1999. As funções de sobrevida foram calculadas empregando-se o estimados de Kaplan-Meier tomando-se como início a data do diagnóstico histológico e como eventos os óbitos cuja causa básica foi o câncer de próstata. Para avaliação dos fatores prognósticos clínicos foram calculadas as hazard ratios (HR), com intervalos de confiança de 95%, seguindo-se o modelo de riscos proporcionais de Cox. Foram analisadas como fatores prognósticos independentes as variáveis: idade, cor, grau de instrução, data do primeiro tratamento, grau de diferenciação celular d o tumor primário biopsiado (Gleason), estadiamento clínico e PSA total pré-tratamento. O pressuposto dos riscos proporcionais foi avaliado pela análise dos resíduos de Schoenfeld e a influência de valores aberrantes pelos resíduos martingale e escore. Foram selecionados 258 pacientes pelos critérios de elegibilidade do estudo, dos quais 46 foram a óbito durante o período de seguimento. A sobrevida global foi de 88% em 5 anos e de 71% em 10 anos. Idade maior que 80 anos, classificação de Gleason maior que 6, PSA maior que 40ng/ml, estádio B2 e cor branca foram marcadores independentes de pior prognóstico. Fatores prognósticos clássicos na literatura foram úteis na estimativa do prognóstico nesta coorte hospitalar. Os resultados mostram que para pacientes diagnosticados em fases iniciais, os fatores sócio-econômico analisados, não influenciaram o prognóstico. Outros estudos devem ser conduzidos no país para investigar as diferenças no prognóstico em relação à etnia.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Growth of a temperate reefa-ssociated fish, the purple wrasse (Notolabrus fucicola), was examined from two sites on the east coast of Tasmania by using age- and length-based models. Models based on the von Bertalanffy growth function, in the standard and a reparameterized form, were constructed by using otolith-derived age estimates. Growth trajectories from tag-recaptures were used to construct length-based growth models derived from the GROTAG model, in turn a reparameterization of the Fabens model. Likelihood ratio tests (LRTs) determined the optimal parameterization of the GROTAG model, including estimators of individual growth variability, seasonal growth, measurement error, and outliers for each data set. Growth models and parameter estimates were compared by bootstrap confidence intervals, LRTs, and randomization tests and plots of bootstrap parameter estimates. The relative merit of these methods for comparing models and parameters was evaluated; LRTs combined with bootstrapping and randomization tests provided the most insight into the relationships between parameter estimates. Significant differences in growth of purple wrasse were found between sites in both length- and age-based models. A significant difference in the peak growth season was found between sites, and a large difference in growth rate between sexes was found at one site with the use of length-based models.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Os granitoides do Domínio Cambuci, na região limítrofe entre os estados do Rio de Janeiro e Espírito Santo, foram separados em quatro principais grupos: (1) Complexo Serra da Bolívia (CSB) - Ortogranulitos e Ortognaisses Heterogêneos; Ortognaisse Cinza Foliado; e charnockitos da Região de Monte Verde (2) Leucogranitos/leucocharnockitos gnaissificados da Suíte São João do Paraíso (SSJP) (3) Granito Cinza Foliado (4) Leucogranito isotrópico. O CSB é caracterizado pelo magmatismo de caráter calcioalcalino do tipo I, oriundo em ambiente de arco vulcânico (Suíte Monte Verde) e retrabalhamento crustal (ortogranulitos leucocráticos). O Ortogranulito esverdeado fino, é considerado no presente estudo como rocha do embasamento para o Terreno Oriental, cristalizada durante o paleoproterozoico - Riaciano (2184,3 21 Ma) e recristalizada durante o evento metamórfico Brasiliano no neoproterozoico - Edicariano (607,2 1,5 Ma), cuja idade TDM é de 2936 Ma. O Ortogranulito leucocrático médio cristalizou-se no neoproterozoico Edicariano (entre 592 e 609 Ma) e idade TDM ca. 2100 Ma, ao qual apresenta registro de herança no paleoproterozoico. A Suíte Monte Verde caracteriza-se por um magmatismo calcioalcalino e a Suíte Córrego Fortaleza, por um magmatismo calcioalcalino de alto K, ambas com assinatura de arco magmático. Registram dois pulsos magmáticos, em no Neoproterozoico - Edicarano: um em 592 2 Ma, idade do charnoenderbito, com idade TDM 1797 Ma, e outro em 571,2 1,8 Ma (injeção de um charnockitoide). Para todas as rochas do CSB são registradas feições protomiloníticas, miloníticas e localmente ultramiloníticas. Os dados geoquímicos indicam que os granitoides da SSJP são da série calcioalcalina de alto K, gerados no Neoproterozoico (idades que variam desde 610,3 4,7 Ma até, 592,2 1,3 Ma. As idades TDM revelam valores discrepantes para duas amostras: 1918 Ma e 2415 Ma, sugerindo que tenham sido geradas de diferentes fontes. O Granito Cinza Foliado é da Série Shoshonítica, metaluminoso do tipo I e, de ambiência tectônica de granitos intraplaca. Entretanto, poderiam ter sido fomados em ambiente de arco cordilheirano, havendo contaminação de outras fontes crustais. Fato este pode ser confirmado pelas as idades TDM calculadas ≈ 1429 1446 Ma. O Leucogranito isotrópico ocorre em forma de diques de direção NW, possui textura maciça e é inequigranular. Dados geoquímicos revelam que são granitoides metaluminosos do tipo I da série shoshonítica, e, de acordo com a ambiência tectônica, são granitos intraplaca. O Leucogranito Isotrópico representa o magmatismo pós-colisional ao qual ocorreu entre 80 a 90 Ma de anos após o término do evento colisional na região central da Faixa Ribeira. O Leucogranito Issotrópico cristalizou-se no cambriano (512,3 3,3 Ma e 508,6 2,2 Ma) e com idades TDM ca. 1900

Relevância:

10.00% 10.00%

Publicador:

Resumo:

O objetivo deste trabalho foi estabelecer um modelo empregando-se ferramentas de regressão multivariada para a previsão do teor em ésteres metílicos e, simultaneamente, de propriedades físico-químicas de misturas de óleo de soja e biodiesel de soja. O modelo foi proposto a partir da correlação das propriedades de interesse com os espectros de reflectância total atenuada no infravermelho médio das misturas. Para a determinação dos teores de ésteres metílicos foi utilizada a cromatografia líquida de alta eficiência (HPLC), podendo esta ser uma técnica alternativa aos método de referência que utilizam a cromatografia em fase gasosa (EN 14103 e EN 14105). As propriedades físico-químicas selecionadas foram índice de refração, massa específica e viscosidade. Para o estudo, foram preparadas 11 misturas com diferentes proporções de biodiesel de soja e de óleo de soja (0-100 % em massa de biodiesel de soja), em quintuplicata, totalizando 55 amostras. A região do infravermelho estudada foi a faixa de 3801 a 650 cm-1. Os espectros foram submetidos aos pré-tratamentos de correção de sinal multiplicativo (MSC) e, em seguida, à centralização na média (MC). As propriedades de interesse foram submetidas ao autoescalamento. Em seguida foi aplicada análise de componentes principais (PCA) com a finalidade de reduzir a dimensionalidade dos dados e detectar a presença de valores anômalos. Quando estes foram detectados, a amostra era descartada. Os dados originais foram submetidos ao algoritmo de Kennard-Stone dividindo-os em um conjunto de calibração, para a construção do modelo, e um conjunto de validação, para verificar a sua confiabilidade. Os resultados mostraram que o modelo proposto por PLS2 (Mínimos Quadrados Parciais) foi capaz de se ajustar bem os dados de índice de refração e de massa específica, podendo ser observado um comportamento aleatório dos erros, indicando a presença de homocedasticidade nos valores residuais, em outras palavras, o modelo construído apresentou uma capacidade de previsão para as propriedades de massa específica e índice de refração com 95% de confiança. A exatidão do modelo foi também avaliada através da estimativa dos parâmetros de regressão que são a inclinação e o intercepto pela Região Conjunta da Elipse de Confiança (EJCR). Os resultados confirmaram que o modelo MIR-PLS desenvolvido foi capaz de prever, simultaneamente, as propriedades índice de refração e massa específica. Para os teores de éteres metílicos determinados por HPLC, foi também desenvolvido um modelo MIR-PLS para correlacionar estes valores com os espectros de MIR, porém a qualidade do ajuste não foi tão boa. Apesar disso, foi possível mostrar que os dados podem ser modelados e correlacionados com os espectros de infravermelho utilizando calibração multivariada

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Neste trabalho foi utilizado um método matemático para classificar registros de potencial e corrente de ensaios de corrosão na técnica de amperimetria de resistência nula (ZRA). Foi aplicado o método estatístico de múltiplas variáveis simples chamado Análise dos Componentes Principais (PCA), cujo objetivo principal foi identificar padrões nestes dados de ruído eletroquímico. Foram testados o aço carbono UNS G10200, os aços inoxidáveis austenítico UNS S31600 e o superduplex UNS S32750 em meios de ácido sulfúrico (5% H2SO4), cloreto férrico (0,1 mol/L FeCl3) e hidróxido de sódio (0,1% NaOH). Os ensaios foram replicados com oito repetições para se ter reprodutibilidade e conhecimento dos aspectos estatísticos envolvidos. Os resultados mostraram que a análise de componentes principais pode ser utilizada como uma ferramenta para analisar sinais de ruído eletroquímico, identificando os clusters dos comportamentos potencial-tempo, corrente-tempo e acessoriamente identificar os outliersdos registros temporais.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Understanding the regulatory mechanisms that are responsible for an organism's response to environmental change is an important issue in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates, and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 observed time points. In classification experiments, our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

To bring out the relative efficiency of various types of fishing gears, in the analysis of catch data, a combination of Tukey's test, consequent transformation and graphical analysis for outlier elimination has been introduced, which can be advantageously used for applying ANOVA techniques, Application of these procedures to actual sets of data showed that nonadditivity in the data was caused by either the presence of outliers, or the absence of a suitable transformation or both. As a corollary, the concurrent model: X sub(ij) = µ + α sub(i) + β sub(j) + λ α sub(i) β sub(j) + E sub(ij) adequately fits the data.