311 resultados para Outliers
Resumo:
Parent, L. E., Natale, W. and Ziadi, N. 2009. Compositional nutrient diagnosis of corn using the Mahalanobis distance as nutrient imbalance index. Can. J. Soil Sci. 89: 383-390. Compositional nutrient diagnosis (CND) provides a plant nutrient imbalance index (CND - r(2)) with assumed chi(2) distribution. The Mahalanobis distance D(2), which detects outliers in compositional data sets, also has a chi(2) distribution. The objective of this paper was to compare D(2) and CND - r(2) nutrient imbalance indexes in corn (Zea mays L.). We measured grain yield as well as N, P, K, Ca, Mg, Cu, Fe, Mn, and Zn concentrations in the ear leaf at silk stage for 210 calibration sites in the St. Lawrence Lowlands [2300-2700 corn thermal units (CTU)] as well as 30 phosphorus (2300-2700 CTU; 10 sites) and 10 nitrogen (1900-2100 CTU; one site) replicated fertilizer treatments for validation. We derived CND norms as mean, standard deviation, and the inverse covariance matrix of centred log ratios (clr) for high yielding specimens (>= 9.0 Mg grain ha(-1) at 150 g H(2)O kg(-1) moisture content) in the 2300-2700 CTU zone. Using chi(2) = 17 (P < 0.05) with nine degrees of freedom (i.e., nine nutrients) as a rejection criterion for outliers and a yield threshold of 8.6 Mg ha(-1) after Cate-Nelson partitioning between low- and high-yielders in the P validation data set, D(2) misclassified two specimens compared with nine for CND -r(2). The D(2) classification was not significantly different from a chi(2) classification (P > 0.05), but the CND - r(2) classification differed significantly from chi(2) or D(2) (P < 0.001). A threshold value for nutrient imbalance could thus be derived probabilistically for conducting D(2) diagnosis, while the CND - r(2) nutrient imbalance threshold must be calibrated using fertilizer trials. In the proposed CND - D(2) procedure, D(2) is first computed to classify the specimen as possible outlier. Thereafter, nutrient indices are ranked in their order of limitation. The D(2) norms appeared less effective in the 1900-2100 CTU zone.
Resumo:
Background: Sugarcane is an increasingly economically and environmentally important C4 grass, used for the production of sugar and bioethanol, a low-carbon emission fuel. Sugarcane originated from crosses of Saccharum species and is noted for its unique capacity to accumulate high amounts of sucrose in its stems. Environmental stresses limit enormously sugarcane productivity worldwide. To investigate transcriptome changes in response to environmental inputs that alter yield we used cDNA microarrays to profile expression of 1,545 genes in plants submitted to drought, phosphate starvation, herbivory and N-2-fixing endophytic bacteria. We also investigated the response to phytohormones (abscisic acid and methyl jasmonate). The arrayed elements correspond mostly to genes involved in signal transduction, hormone biosynthesis, transcription factors, novel genes and genes corresponding to unknown proteins.Results: Adopting an outliers searching method 179 genes with strikingly different expression levels were identified as differentially expressed in at least one of the treatments analysed. Self Organizing Maps were used to cluster the expression profiles of 695 genes that showed a highly correlated expression pattern among replicates. The expression data for 22 genes was evaluated for 36 experimental data points by quantitative RT-PCR indicating a validation rate of 80.5% using three biological experimental replicates. The SUCAST Database was created that provides public access to the data described in this work, linked to tissue expression profiling and the SUCAST gene category and sequence analysis. The SUCAST database also includes a categorization of the sugarcane kinome based on a phylogenetic grouping that included 182 undefined kinases.Conclusion: An extensive study on the sugarcane transcriptome was performed. Sugarcane genes responsive to phytohormones and to challenges sugarcane commonly deals with in the field were identified. Additionally, the protein kinases were annotated based on a phylogenetic approach. The experimental design and statistical analysis applied proved robust to unravel genes associated with a diverse array of conditions attributing novel functions to previously unknown or undefined genes. The data consolidated in the SUCAST database resource can guide further studies and be useful for the development of improved sugarcane varieties.
Resumo:
In this work, we propose a two-stage algorithm for real-time fault detection and identification of industrial plants. Our proposal is based on the analysis of selected features using recursive density estimation and a new evolving classifier algorithm. More specifically, the proposed approach for the detection stage is based on the concept of density in the data space, which is not the same as probability density function, but is a very useful measure for abnormality/outliers detection. This density can be expressed by a Cauchy function and can be calculated recursively, which makes it memory and computational power efficient and, therefore, suitable for on-line applications. The identification/diagnosis stage is based on a self-developing (evolving) fuzzy rule-based classifier system proposed in this work, called AutoClass. An important property of AutoClass is that it can start learning from scratch". Not only do the fuzzy rules not need to be prespecified, but neither do the number of classes for AutoClass (the number may grow, with new class labels being added by the on-line learning process), in a fully unsupervised manner. In the event that an initial rule base exists, AutoClass can evolve/develop it further based on the newly arrived faulty state data. In order to validate our proposal, we present experimental results from a level control didactic process, where control and error signals are used as features for the fault detection and identification systems, but the approach is generic and the number of features can be significant due to the computationally lean methodology, since covariance or more complex calculations, as well as storage of old data, are not required. The obtained results are significantly better than the traditional approaches used for comparison
Resumo:
Two-level factorial designs are widely used in industrial experimentation. However, many factors in such a design require a large number of runs to perform the experiment, and too many replications of the treatments may not be feasible, considering limitations of resources and of time, making it expensive. In these cases, unreplicated designs are used. But, with only one replicate, there is no internal estimate of experimental error to make judgments about the significance of the observed efects. One of the possible solutions for this problem is to use normal plots or half-normal plots of the efects. Many experimenters use the normal plot, while others prefer the half-normal plot and, often, for both cases, without justification. The controversy about the use of these two graphical techniques motivates this work, once there is no register of formal procedure or statistical test that indicates \which one is best". The choice between the two plots seems to be a subjective issue. The central objective of this master's thesis is, then, to perform an experimental comparative study of the normal plot and half-normal plot in the context of the analysis of the 2k unreplicated factorial experiments. This study involves the construction of simulated scenarios, in which the graphics performance to detect significant efects and to identify outliers is evaluated in order to verify the following questions: Can be a plot better than other? In which situations? What kind of information does a plot increase to the analysis of the experiment that might complement those provided by the other plot? What are the restrictions on the use of graphics? Herewith, this work intends to confront these two techniques; to examine them simultaneously in order to identify similarities, diferences or relationships that contribute to the construction of a theoretical reference to justify or to aid in the experimenter's decision about which of the two graphical techniques to use and the reason for this use. The simulation results show that the half-normal plot is better to assist in the judgement of the efects, while the normal plot is recommended to detect outliers in the data
Resumo:
Statistical analysis of data is crucial in cephalometric investigations. There are certainly excellent examples of good statistical practice in the field, but some articles published worldwide have carried out inappropriate analyses. Objective: The purpose of this study was to show that when the double records of each patient are traced on the same occasion, a control chart for differences between readings needs to be drawn, and limits of agreement and coefficients of repeatability must be calculated. Material and methods: Data from a well-known paper in Orthodontics were used for showing common statistical practices in cephalometric investigations and for proposing a new technique of analysis. Results: A scatter plot of the two radiograph readings and the two model readings with the respective regression lines are shown. Also, a control chart for the mean of the differences between radiograph readings was obtained and a coefficient of repeatability was calculated. Conclusions: A standard error assuming that mean differences are zero, which is referred to in Orthodontics and Facial Orthopedics as the Dahlberg error, can be calculated only for estimating precision if accuracy is already proven. When double readings are collected, limits of agreement and coefficients of repeatability must be calculated. A graph with differences of readings should be presented and outliers discussed.
Resumo:
Linear mixed effects models are frequently used to analyse longitudinal data, due to their flexibility in modelling the covariance structure between and within observations. Further, it is easy to deal with unbalanced data, either with respect to the number of observations per subject or per time period, and with varying time intervals between observations. In most applications of mixed models to biological sciences, a normal distribution is assumed both for the random effects and for the residuals. This, however, makes inferences vulnerable to the presence of outliers. Here, linear mixed models employing thick-tailed distributions for robust inferences in longitudinal data analysis are described. Specific distributions discussed include the Student-t, the slash and the contaminated normal. A Bayesian framework is adopted, and the Gibbs sampler and the Metropolis-Hastings algorithms are used to carry out the posterior analyses. An example with data on orthodontic distance growth in children is discussed to illustrate the methodology. Analyses based on either the Student-t distribution or on the usual Gaussian assumption are contrasted. The thick-tailed distributions provide an appealing robust alternative to the Gaussian process for modelling distributions of the random effects and of residuals in linear mixed models, and the MCMC implementation allows the computations to be performed in a flexible manner.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Two Kalman-filter formulations are presented for the estimation of spacecraft sensor misalignments from inflight data. In the first the sensor misalignments are part of the filter state variable; in the second, which we call HYLIGN, the state vector contains only dynamical variables, but the sensitivities of the filter innovations to the misalignments are calculated within the Kalman filter. This procedure permits the misalignments to be estimated in batch mode as well as a much smaller dimension for the Kalman filter state vector. This results not only in a significantly smaller computational burden but also in a smaller sensitivity of the misalignment estimates to outliers in the data. Numerical simulations of the filter performance are presented.
Resumo:
We propose alternative approaches to analyze residuals in binary regression models based on random effect components. Our preferred model does not depend upon any tuning parameter, being completely automatic. Although the focus is mainly on accommodation of outliers, the proposed methodology is also able to detect them. Our approach consists of evaluating the posterior distribution of random effects included in the linear predictor. The evaluation of the posterior distributions of interest involves cumbersome integration, which is easily dealt with through stochastic simulation methods. We also discuss different specifications of prior distributions for the random effects. The potential of these strategies is compared in a real data set. The main finding is that the inclusion of extra variability accommodates the outliers, improving the adjustment of the model substantially, besides correctly indicating the possible outliers.
Resumo:
Deals with some common problems in structural analysis when calculating the experimental semi-variogram and fitting a semi-variogram model. Geochemical data were used and the following cases were studied: regular versus irregular sampling grade, presence of 'outliers' values, skew distributions due to a high variability of the data and estimation using a kriging procedure. -from English summary
Resumo:
Two Kalman-filter formulations are presented for the estimation of spacecraft sensor misalignments from inflight data. In the first the sensor misalignments are part of the filter state variable; in the second the state vector contains only dynamical variables, but the sensitivities of the filter innovations to the misalignments are calculated within the Kalman filter. This procedure permits the misalignments to be estimated in batch mode as well as a much smaller dimension for the Kalman filter state vector. This results not only in a significantly smaller computational burden but also in a smaller sensitivity of the misalignment estimates to outliers in the data. Numerical simulations of the filter performance are presented.
Resumo:
Linear mixed effects models have been widely used in analysis of data where responses are clustered around some random effects, so it is not reasonable to assume independence between observations in the same cluster. In most biological applications, it is assumed that the distributions of the random effects and of the residuals are Gaussian. This makes inferences vulnerable to the presence of outliers. Here, linear mixed effects models with normal/independent residual distributions for robust inferences are described. Specific distributions examined include univariate and multivariate versions of the Student-t, the slash and the contaminated normal. A Bayesian framework is adopted and Markov chain Monte Carlo is used to carry out the posterior analysis. The procedures are illustrated using birth weight data on rats in a texicological experiment. Results from the Gaussian and robust models are contrasted, and it is shown how the implementation can be used for outlier detection. The thick-tailed distributions provide an appealing robust alternative to the Gaussian process in linear mixed models, and they are easily implemented using data augmentation and MCMC techniques.
Resumo:
Forest roads are frequently pointed as source of environmental problems related to erosion and they also influence harvest cost due to maintenance operations. Roads not well designed are sources of hydrological problems on catchments and the current attention to sustainability of forest exploration projects point out to the need of diagnostics tools for guiding the redesign of the road system. At this study, runoff hydrological indicators for forest road segments were assessed in order to identify critical points of erosion and water concentration on soils. A road network of a forest production area was divided into 252 road segments that were used as observations of four variables: mean terrain slope, main segment slope, LS factor and topographic index. The data analysis was based on descriptive statistics for outliers' identification, principal component analysis and for variability study between variables and between observations, and cluster analysis for similar segments groups' identification. The results allowed classifying roads segments into five mains road types: road on the ridge, on the valley, on the slopes, on the slopes but in a contour line and on the steepest slope. The indicators were able to highlight the most critical segments that differ of others and are potential sources of erosion and water accumulation problems on forest roads. The principal component analysis showed two main variability sources related to terrain topographic characteristics and also road design, showing that indicators represent well those elements. The methodology seems to be appropriated for identification of critical road segments that need to be redesigned and also for road network planning at new forest exploration projects.
Resumo:
In this work, a new approach for supervised pattern recognition is presented which improves the learning algorithm of the Optimum-Path Forest classifier (OPF), centered on detection and elimination of outliers in the training set. Identification of outliers is based on a penalty computed for each sample in the training set from the corresponding number of imputable false positive and false negative classification of samples. This approach enhances the accuracy of OPF while still gaining in classification time, at the expense of a slight increase in training time. © 2010 Springer-Verlag.
Resumo:
Includes bibliography