21 resultados para Outliers

em Repositório Institucional UNESP - Universidade Estadual Paulista "Julio de Mesquita Filho"


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Pós-graduação em Ciência da Computação - IBILCE

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Parent, L. E., Natale, W. and Ziadi, N. 2009. Compositional nutrient diagnosis of corn using the Mahalanobis distance as nutrient imbalance index. Can. J. Soil Sci. 89: 383-390. Compositional nutrient diagnosis (CND) provides a plant nutrient imbalance index (CND - r(2)) with assumed chi(2) distribution. The Mahalanobis distance D(2), which detects outliers in compositional data sets, also has a chi(2) distribution. The objective of this paper was to compare D(2) and CND - r(2) nutrient imbalance indexes in corn (Zea mays L.). We measured grain yield as well as N, P, K, Ca, Mg, Cu, Fe, Mn, and Zn concentrations in the ear leaf at silk stage for 210 calibration sites in the St. Lawrence Lowlands [2300-2700 corn thermal units (CTU)] as well as 30 phosphorus (2300-2700 CTU; 10 sites) and 10 nitrogen (1900-2100 CTU; one site) replicated fertilizer treatments for validation. We derived CND norms as mean, standard deviation, and the inverse covariance matrix of centred log ratios (clr) for high yielding specimens (>= 9.0 Mg grain ha(-1) at 150 g H(2)O kg(-1) moisture content) in the 2300-2700 CTU zone. Using chi(2) = 17 (P < 0.05) with nine degrees of freedom (i.e., nine nutrients) as a rejection criterion for outliers and a yield threshold of 8.6 Mg ha(-1) after Cate-Nelson partitioning between low- and high-yielders in the P validation data set, D(2) misclassified two specimens compared with nine for CND -r(2). The D(2) classification was not significantly different from a chi(2) classification (P > 0.05), but the CND - r(2) classification differed significantly from chi(2) or D(2) (P < 0.001). A threshold value for nutrient imbalance could thus be derived probabilistically for conducting D(2) diagnosis, while the CND - r(2) nutrient imbalance threshold must be calibrated using fertilizer trials. In the proposed CND - D(2) procedure, D(2) is first computed to classify the specimen as possible outlier. Thereafter, nutrient indices are ranked in their order of limitation. The D(2) norms appeared less effective in the 1900-2100 CTU zone.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Sugarcane is an increasingly economically and environmentally important C4 grass, used for the production of sugar and bioethanol, a low-carbon emission fuel. Sugarcane originated from crosses of Saccharum species and is noted for its unique capacity to accumulate high amounts of sucrose in its stems. Environmental stresses limit enormously sugarcane productivity worldwide. To investigate transcriptome changes in response to environmental inputs that alter yield we used cDNA microarrays to profile expression of 1,545 genes in plants submitted to drought, phosphate starvation, herbivory and N-2-fixing endophytic bacteria. We also investigated the response to phytohormones (abscisic acid and methyl jasmonate). The arrayed elements correspond mostly to genes involved in signal transduction, hormone biosynthesis, transcription factors, novel genes and genes corresponding to unknown proteins.Results: Adopting an outliers searching method 179 genes with strikingly different expression levels were identified as differentially expressed in at least one of the treatments analysed. Self Organizing Maps were used to cluster the expression profiles of 695 genes that showed a highly correlated expression pattern among replicates. The expression data for 22 genes was evaluated for 36 experimental data points by quantitative RT-PCR indicating a validation rate of 80.5% using three biological experimental replicates. The SUCAST Database was created that provides public access to the data described in this work, linked to tissue expression profiling and the SUCAST gene category and sequence analysis. The SUCAST database also includes a categorization of the sugarcane kinome based on a phylogenetic grouping that included 182 undefined kinases.Conclusion: An extensive study on the sugarcane transcriptome was performed. Sugarcane genes responsive to phytohormones and to challenges sugarcane commonly deals with in the field were identified. Additionally, the protein kinases were annotated based on a phylogenetic approach. The experimental design and statistical analysis applied proved robust to unravel genes associated with a diverse array of conditions attributing novel functions to previously unknown or undefined genes. The data consolidated in the SUCAST database resource can guide further studies and be useful for the development of improved sugarcane varieties.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Statistical analysis of data is crucial in cephalometric investigations. There are certainly excellent examples of good statistical practice in the field, but some articles published worldwide have carried out inappropriate analyses. Objective: The purpose of this study was to show that when the double records of each patient are traced on the same occasion, a control chart for differences between readings needs to be drawn, and limits of agreement and coefficients of repeatability must be calculated. Material and methods: Data from a well-known paper in Orthodontics were used for showing common statistical practices in cephalometric investigations and for proposing a new technique of analysis. Results: A scatter plot of the two radiograph readings and the two model readings with the respective regression lines are shown. Also, a control chart for the mean of the differences between radiograph readings was obtained and a coefficient of repeatability was calculated. Conclusions: A standard error assuming that mean differences are zero, which is referred to in Orthodontics and Facial Orthopedics as the Dahlberg error, can be calculated only for estimating precision if accuracy is already proven. When double readings are collected, limits of agreement and coefficients of repeatability must be calculated. A graph with differences of readings should be presented and outliers discussed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Linear mixed effects models are frequently used to analyse longitudinal data, due to their flexibility in modelling the covariance structure between and within observations. Further, it is easy to deal with unbalanced data, either with respect to the number of observations per subject or per time period, and with varying time intervals between observations. In most applications of mixed models to biological sciences, a normal distribution is assumed both for the random effects and for the residuals. This, however, makes inferences vulnerable to the presence of outliers. Here, linear mixed models employing thick-tailed distributions for robust inferences in longitudinal data analysis are described. Specific distributions discussed include the Student-t, the slash and the contaminated normal. A Bayesian framework is adopted, and the Gibbs sampler and the Metropolis-Hastings algorithms are used to carry out the posterior analyses. An example with data on orthodontic distance growth in children is discussed to illustrate the methodology. Analyses based on either the Student-t distribution or on the usual Gaussian assumption are contrasted. The thick-tailed distributions provide an appealing robust alternative to the Gaussian process for modelling distributions of the random effects and of residuals in linear mixed models, and the MCMC implementation allows the computations to be performed in a flexible manner.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Two Kalman-filter formulations are presented for the estimation of spacecraft sensor misalignments from inflight data. In the first the sensor misalignments are part of the filter state variable; in the second, which we call HYLIGN, the state vector contains only dynamical variables, but the sensitivities of the filter innovations to the misalignments are calculated within the Kalman filter. This procedure permits the misalignments to be estimated in batch mode as well as a much smaller dimension for the Kalman filter state vector. This results not only in a significantly smaller computational burden but also in a smaller sensitivity of the misalignment estimates to outliers in the data. Numerical simulations of the filter performance are presented.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose alternative approaches to analyze residuals in binary regression models based on random effect components. Our preferred model does not depend upon any tuning parameter, being completely automatic. Although the focus is mainly on accommodation of outliers, the proposed methodology is also able to detect them. Our approach consists of evaluating the posterior distribution of random effects included in the linear predictor. The evaluation of the posterior distributions of interest involves cumbersome integration, which is easily dealt with through stochastic simulation methods. We also discuss different specifications of prior distributions for the random effects. The potential of these strategies is compared in a real data set. The main finding is that the inclusion of extra variability accommodates the outliers, improving the adjustment of the model substantially, besides correctly indicating the possible outliers.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Deals with some common problems in structural analysis when calculating the experimental semi-variogram and fitting a semi-variogram model. Geochemical data were used and the following cases were studied: regular versus irregular sampling grade, presence of 'outliers' values, skew distributions due to a high variability of the data and estimation using a kriging procedure. -from English summary

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Two Kalman-filter formulations are presented for the estimation of spacecraft sensor misalignments from inflight data. In the first the sensor misalignments are part of the filter state variable; in the second the state vector contains only dynamical variables, but the sensitivities of the filter innovations to the misalignments are calculated within the Kalman filter. This procedure permits the misalignments to be estimated in batch mode as well as a much smaller dimension for the Kalman filter state vector. This results not only in a significantly smaller computational burden but also in a smaller sensitivity of the misalignment estimates to outliers in the data. Numerical simulations of the filter performance are presented.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Linear mixed effects models have been widely used in analysis of data where responses are clustered around some random effects, so it is not reasonable to assume independence between observations in the same cluster. In most biological applications, it is assumed that the distributions of the random effects and of the residuals are Gaussian. This makes inferences vulnerable to the presence of outliers. Here, linear mixed effects models with normal/independent residual distributions for robust inferences are described. Specific distributions examined include univariate and multivariate versions of the Student-t, the slash and the contaminated normal. A Bayesian framework is adopted and Markov chain Monte Carlo is used to carry out the posterior analysis. The procedures are illustrated using birth weight data on rats in a texicological experiment. Results from the Gaussian and robust models are contrasted, and it is shown how the implementation can be used for outlier detection. The thick-tailed distributions provide an appealing robust alternative to the Gaussian process in linear mixed models, and they are easily implemented using data augmentation and MCMC techniques.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Forest roads are frequently pointed as source of environmental problems related to erosion and they also influence harvest cost due to maintenance operations. Roads not well designed are sources of hydrological problems on catchments and the current attention to sustainability of forest exploration projects point out to the need of diagnostics tools for guiding the redesign of the road system. At this study, runoff hydrological indicators for forest road segments were assessed in order to identify critical points of erosion and water concentration on soils. A road network of a forest production area was divided into 252 road segments that were used as observations of four variables: mean terrain slope, main segment slope, LS factor and topographic index. The data analysis was based on descriptive statistics for outliers' identification, principal component analysis and for variability study between variables and between observations, and cluster analysis for similar segments groups' identification. The results allowed classifying roads segments into five mains road types: road on the ridge, on the valley, on the slopes, on the slopes but in a contour line and on the steepest slope. The indicators were able to highlight the most critical segments that differ of others and are potential sources of erosion and water accumulation problems on forest roads. The principal component analysis showed two main variability sources related to terrain topographic characteristics and also road design, showing that indicators represent well those elements. The methodology seems to be appropriated for identification of critical road segments that need to be redesigned and also for road network planning at new forest exploration projects.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this work, a new approach for supervised pattern recognition is presented which improves the learning algorithm of the Optimum-Path Forest classifier (OPF), centered on detection and elimination of outliers in the training set. Identification of outliers is based on a penalty computed for each sample in the training set from the corresponding number of imputable false positive and false negative classification of samples. This approach enhances the accuracy of OPF while still gaining in classification time, at the expense of a slight increase in training time. © 2010 Springer-Verlag.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Identification and classification of overlapping nodes in networks are important topics in data mining. In this paper, a network-based (graph-based) semi-supervised learning method is proposed. It is based on competition and cooperation among walking particles in a network to uncover overlapping nodes by generating continuous-valued outputs (soft labels), corresponding to the levels of membership from the nodes to each of the communities. Moreover, the proposed method can be applied to detect overlapping data items in a data set of general form, such as a vector-based data set, once it is transformed to a network. Usually, label propagation involves risks of error amplification. In order to avoid this problem, the proposed method offers a mechanism to identify outliers among the labeled data items, and consequently prevents error propagation from such outliers. Computer simulations carried out for synthetic and real-world data sets provide a numeric quantification of the performance of the method. © 2012 Springer-Verlag.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The meta-analysis was used to evaluate the performance of piglets in post-weaning period, without imposition of sanitary challenge and fed diets containing blood plasma, obtained by spray-dried process (SDBP). Piglets are faced with normal challenges in post-weaning period such as environmental stress and the substitution of the liquid diet to a solid one. References regarding sanitary challenges were disregarded in this study. Only data regarding normal and expected challenges were considered. Data were obtained from indexed journals with information extracted from the material, methods and results sections of pre-selected scientific articles. First, the database was analyzed graphically to observe the distribution of data and presence of outliers. Afterwards correlation analysis and variance-covariance analyses were carried out. The database contained a total of 23 articles. The average initial weight of the piglets was 8.02. kg (4.00-9.28. kg) and the average initial age was 27 days (14-32 days). The average duration of feeding diets containing spray-dried blood plasma (SDBP) was 9 days (6-28 days). SDBP increased the feed conversion by 20.2% (P<0.05) during the initial period. Feed conversion during the total period was 10.2% higher (P<0.05) for animals fed with SDBP. Average daily weight gain and daily feed intake were not affected (P>0.05) during the entire period, but average daily gain was higher (P<0.05) for animals fed with SDBP during the initial period. The initial age of supplementation influenced the average daily weight gain and average daily feed intake of animals fed with SDBP. Better results were obtained than those obtained for animals up to 35 days of age fed diets without added SDBP supplementation. In early post-weaning period for piglets weaned up to 35 days of age, the SDBP supplementation positively influenced the average daily weight gain and feed conversion. © 2013 Elsevier B.V.