909 resultados para Data accuracy
Resumo:
The research considers the problem of spatial data classification using machine learning algorithms: probabilistic neural networks (PNN) and support vector machines (SVM). As a benchmark model simple k-nearest neighbor algorithm is considered. PNN is a neural network reformulation of well known nonparametric principles of probability density modeling using kernel density estimator and Bayesian optimal or maximum a posteriori decision rules. PNN is well suited to problems where not only predictions but also quantification of accuracy and integration of prior information are necessary. An important property of PNN is that they can be easily used in decision support systems dealing with problems of automatic classification. Support vector machine is an implementation of the principles of statistical learning theory for the classification tasks. Recently they were successfully applied for different environmental topics: classification of soil types and hydro-geological units, optimization of monitoring networks, susceptibility mapping of natural hazards. In the present paper both simulated and real data case studies (low and high dimensional) are considered. The main attention is paid to the detection and learning of spatial patterns by the algorithms applied.
Resumo:
In many practical applications the state of field soils is monitored by recording the evolution of temperature and soil moisture at discrete depths. We theoretically investigate the systematic errors that arise when mass and energy balances are computed directly from these measurements. We show that, even with no measurement or model errors, large residuals might result when finite difference approximations are used to compute fluxes and storage term. To calculate the limits set by the use of spatially discrete measurements on the accuracy of balance closure, we derive an analytical solution to estimate the residual on the basis of the two key parameters: the penetration depth and the distance between the measurements. When the thickness of the control layer for which the balance is computed is comparable to the penetration depth of the forcing (which depends on the thermal diffusivity and on the forcing period) large residuals arise. The residual is also very sensitive to the distance between the measurements, which requires accurately controlling the position of the sensors in field experiments. We also demonstrate that, for the same experimental setup, mass residuals are sensitively larger than the energy residuals due to the nonlinearity of the moisture transport equation. Our analysis suggests that a careful assessment of the systematic mass error introduced by the use of spatially discrete data is required before using fluxes and residuals computed directly from field measurements.
Resumo:
1. Identifying the boundary of a species' niche from observational and environmental data is a common problem in ecology and conservation biology and a variety of techniques have been developed or applied to model niches and predict distributions. Here, we examine the performance of some pattern-recognition methods as ecological niche models (ENMs). Particularly, one-class pattern recognition is a flexible and seldom used methodology for modelling ecological niches and distributions from presence-only data. The development of one-class methods that perform comparably to two-class methods (for presence/absence data) would remove modelling decisions about sampling pseudo-absences or background data points when absence points are unavailable. 2. We studied nine methods for one-class classification and seven methods for two-class classification (five common to both), all primarily used in pattern recognition and therefore not common in species distribution and ecological niche modelling, across a set of 106 mountain plant species for which presence-absence data was available. We assessed accuracy using standard metrics and compared trade-offs in omission and commission errors between classification groups as well as effects of prevalence and spatial autocorrelation on accuracy. 3. One-class models fit to presence-only data were comparable to two-class models fit to presence-absence data when performance was evaluated with a measure weighting omission and commission errors equally. One-class models were superior for reducing omission errors (i.e. yielding higher sensitivity), and two-classes models were superior for reducing commission errors (i.e. yielding higher specificity). For these methods, spatial autocorrelation was only influential when prevalence was low. 4. These results differ from previous efforts to evaluate alternative modelling approaches to build ENM and are particularly noteworthy because data are from exhaustively sampled populations minimizing false absence records. Accurate, transferable models of species' ecological niches and distributions are needed to advance ecological research and are crucial for effective environmental planning and conservation; the pattern-recognition approaches studied here show good potential for future modelling studies. This study also provides an introduction to promising methods for ecological modelling inherited from the pattern-recognition discipline.
Resumo:
With the trend in molecular epidemiology towards both genome-wide association studies and complex modelling, the need for large sample sizes to detect small effects and to allow for the estimation of many parameters within a model continues to increase. Unfortunately, most methods of association analysis have been restricted to either a family-based or a case-control design, resulting in the lack of synthesis of data from multiple studies. Transmission disequilibrium-type methods for detecting linkage disequilibrium from family data were developed as an effective way of preventing the detection of association due to population stratification. Because these methods condition on parental genotype, however, they have precluded the joint analysis of family and case-control data, although methods for case-control data may not protect against population stratification and do not allow for familial correlations. We present here an extension of a family-based association analysis method for continuous traits that will simultaneously test for, and if necessary control for, population stratification. We further extend this method to analyse binary traits (and therefore family and case-control data together) and accurately to estimate genetic effects in the population, even when using an ascertained family sample. Finally, we present the power of this binary extension for both family-only and joint family and case-control data, and demonstrate the accuracy of the association parameter and variance components in an ascertained family sample.
Resumo:
The uncertainties inherent to experimental differential scanning calorimetric data are evaluated. A new procedure is developed to perform the kinetic analysis of continuous heating calorimetric data when the heat capacity of the sample changes during the crystallization. The accuracy of isothermal calorimetric data is analyzed in terms of the peak-to-peak noise of the calorimetric signal and base line drift typical of differential scanning calorimetry equipment. Their influence in the evaluation of the kinetic parameters is discussed. An empirical construction of the time-temperature and temperature heating rate transformation diagrams, grounded on the kinetic parameters, is presented. The method is applied to the kinetic study of the primary crystallization of Te in an amorphous alloy of nominal composition Ga20Te80, obtained by rapid solidification.
Resumo:
Digital information generates the possibility of a high degree of redundancy in the data available for fitting predictive models used for Digital Soil Mapping (DSM). Among these models, the Decision Tree (DT) technique has been increasingly applied due to its capacity of dealing with large datasets. The purpose of this study was to evaluate the impact of the data volume used to generate the DT models on the quality of soil maps. An area of 889.33 km² was chosen in the Northern region of the State of Rio Grande do Sul. The soil-landscape relationship was obtained from reambulation of the studied area and the alignment of the units in the 1:50,000 scale topographic mapping. Six predictive covariates linked to the factors soil formation, relief and organisms, together with data sets of 1, 3, 5, 10, 15, 20 and 25 % of the total data volume, were used to generate the predictive DT models in the data mining program Waikato Environment for Knowledge Analysis (WEKA). In this study, sample densities below 5 % resulted in models with lower power of capturing the complexity of the spatial distribution of the soil in the study area. The relation between the data volume to be handled and the predictive capacity of the models was best for samples between 5 and 15 %. For the models based on these sample densities, the collected field data indicated an accuracy of predictive mapping close to 70 %.
Resumo:
Field-based soil moisture measurements are cumbersome. Thus, remote sensing techniques are needed because allows field and landscape-scale mapping of soil moisture depth-averaged through the root zone of existing vegetation. The objective of the study was to evaluate the accuracy of an empirical relationship to calculate soil moisture from remote sensing data of irrigated soils of the Apodi Plateau, in the Brazilian semiarid region. The empirical relationship had previously been tested for irrigated soils in Mexico, Egypt, and Pakistan, with promising results. In this study, the relationship was evaluated from experimental data collected from a cotton field. The experiment was carried out in an area of 5 ha with irrigated cotton. The energy balance and evaporative fraction (Λ) were measured by the Bowen ratio method. Soil moisture (θ) data were collected using a PR2 - Profile Probe (Delta-T Devices Ltd). The empirical relationship was tested using experimentally collected Λ and θ values and was applied using the Λ values obtained from the Surface Energy Balance Algorithm for Land (SEBAL) and three TM - Landsat 5 images. There was a close correlation between measured and estimated θ values (p<0.05, R² = 0.84) and there were no significant differences according to the Student t-test (p<0.01). The statistical analyses showed that the empirical relationship can be applied to estimate the root-zone soil moisture of irrigated soils, i.e. when the evaporative fraction is greater than 0.45.
Resumo:
Microstructure imaging from diffusion magnetic resonance (MR) data represents an invaluable tool to study non-invasively the morphology of tissues and to provide a biological insight into their microstructural organization. In recent years, a variety of biophysical models have been proposed to associate particular patterns observed in the measured signal with specific microstructural properties of the neuronal tissue, such as axon diameter and fiber density. Despite very appealing results showing that the estimated microstructure indices agree very well with histological examinations, existing techniques require computationally very expensive non-linear procedures to fit the models to the data which, in practice, demand the use of powerful computer clusters for large-scale applications. In this work, we present a general framework for Accelerated Microstructure Imaging via Convex Optimization (AMICO) and show how to re-formulate this class of techniques as convenient linear systems which, then, can be efficiently solved using very fast algorithms. We demonstrate this linearization of the fitting problem for two specific models, i.e. ActiveAx and NODDI, providing a very attractive alternative for parameter estimation in those techniques; however, the AMICO framework is general and flexible enough to work also for the wider space of microstructure imaging methods. Results demonstrate that AMICO represents an effective means to accelerate the fit of existing techniques drastically (up to four orders of magnitude faster) while preserving accuracy and precision in the estimated model parameters (correlation above 0.9). We believe that the availability of such ultrafast algorithms will help to accelerate the spread of microstructure imaging to larger cohorts of patients and to study a wider spectrum of neurological disorders.
Resumo:
The temporal dynamics of species diversity are shaped by variations in the rates of speciation and extinction, and there is a long history of inferring these rates using first and last appearances of taxa in the fossil record. Understanding diversity dynamics critically depends on unbiased estimates of the unobserved times of speciation and extinction for all lineages, but the inference of these parameters is challenging due to the complex nature of the available data. Here, we present a new probabilistic framework to jointly estimate species-specific times of speciation and extinction and the rates of the underlying birth-death process based on the fossil record. The rates are allowed to vary through time independently of each other, and the probability of preservation and sampling is explicitly incorporated in the model to estimate the true lifespan of each lineage. We implement a Bayesian algorithm to assess the presence of rate shifts by exploring alternative diversification models. Tests on a range of simulated data sets reveal the accuracy and robustness of our approach against violations of the underlying assumptions and various degrees of data incompleteness. Finally, we demonstrate the application of our method with the diversification of the mammal family Rhinocerotidae and reveal a complex history of repeated and independent temporal shifts of both speciation and extinction rates, leading to the expansion and subsequent decline of the group. The estimated parameters of the birth-death process implemented here are directly comparable with those obtained from dated molecular phylogenies. Thus, our model represents a step towards integrating phylogenetic and fossil information to infer macroevolutionary processes.
Resumo:
Aim: Climatic niche modelling of species and community distributions implicitly assumes strong and constant climatic determinism across geographic space. This assumption had however never been tested so far. We tested it by assessing how stacked-species distribution models (S-SDMs) perform for predicting plant species assemblages along elevation. Location: Western Swiss Alps. Methods: Using robust presence-absence data, we first assessed the ability of topo-climatic S-SDMs to predict plant assemblages in a study area encompassing a 2800 m wide elevation gradient. We then assessed the relationships among several evaluation metrics and trait-based tests of community assembly rules. Results: The standard errors of individual SDMs decreased significantly towards higher elevations. Overall, the S-SDM overpredicted far more than they underpredicted richness and could not reproduce the humpback curve along elevation. Overprediction was greater at low and mid-range elevations in absolute values but greater at high elevations when standardised by the actual richness. Looking at species composition, the evaluation metrics accounting for both the presence and absence of species (overall prediction success and kappa) or focusing on correctly predicted absences (specificity) increased with increasing elevation, while the metrics focusing on correctly predicted presences (Jaccard index and sensitivity) decreased. The best overall evaluation - as driven by specificity - occurred at high elevation where species assemblages were shown to be under significant environmental filtering of small plants. In contrast, the decreased overall accuracy in the lowlands was associated with functional patterns representing any type of assembly rule (environmental filtering, limiting similarity or null assembly). Main Conclusions: Our study reveals interesting patterns of change in S-SDM errors with changes in assembly rules along elevation. Yet, significant levels of assemblage prediction errors occurred throughout the gradient, calling for further improvement of SDMs, e.g., by adding key environmental filters that act at fine scales and developing approaches to account for variations in the influence of predictors along environmental gradients.
Resumo:
Predictive groundwater modeling requires accurate information about aquifer characteristics. Geophysical imaging is a powerful tool for delineating aquifer properties at an appropriate scale and resolution, but it suffers from problems of ambiguity. One way to overcome such limitations is to adopt a simultaneous multitechnique inversion strategy. We have developed a methodology for aquifer characterization based on structural joint inversion of multiple geophysical data sets followed by clustering to form zones and subsequent inversion for zonal parameters. Joint inversions based on cross-gradient structural constraints require less restrictive assumptions than, say, applying predefined petro-physical relationships and generally yield superior results. This approach has, for the first time, been applied to three geophysical data types in three dimensions. A classification scheme using maximum likelihood estimation is used to determine the parameters of a Gaussian mixture model that defines zonal geometries from joint-inversion tomograms. The resulting zones are used to estimate representative geophysical parameters of each zone, which are then used for field-scale petrophysical analysis. A synthetic study demonstrated how joint inversion of seismic and radar traveltimes and electrical resistance tomography (ERT) data greatly reduces misclassification of zones (down from 21.3% to 3.7%) and improves the accuracy of retrieved zonal parameters (from 1.8% to 0.3%) compared to individual inversions. We applied our scheme to a data set collected in northeastern Switzerland to delineate lithologic subunits within a gravel aquifer. The inversion models resolve three principal subhorizontal units along with some important 3D heterogeneity. Petro-physical analysis of the zonal parameters indicated approximately 30% variation in porosity within the gravel aquifer and an increasing fraction of finer sediments with depth.
Resumo:
The paper presents a novel method for monitoring network optimisation, based on a recent machine learning technique known as support vector machine. It is problem-oriented in the sense that it directly answers the question of whether the advised spatial location is important for the classification model. The method can be used to increase the accuracy of classification models by taking a small number of additional measurements. Traditionally, network optimisation is performed by means of the analysis of the kriging variances. The comparison of the method with the traditional approach is presented on a real case study with climate data.
Resumo:
Purpose: To assess the global cardiovascular (CV) risk of an individual, several scores have been developed. However, their accuracy and comparability need to be evaluated in populations others from which they were derived. The aim of this study was to compare the predictive accuracy of 4 CV risk scores using data of a large population-based cohort. Methods: Prospective cohort study including 4980 participants (2698 women, mean age± SD: 52.7±10.8 years) in Lausanne, Switzerland followed for an average of 5.5 years (range 0.2 - 8.5). Two end points were assessed: 1) coronary heart disease (CHD), and 2) CV diseases (CVD). Four risk scores were compared: original and recalibrated Framingham coronary heart disease scores (1998 and 2001); original PROCAM score (2002) and its recalibrated version for Switzerland (IAS-AGLA); Reynolds risk score. Discrimination was assessed using Harrell's C statistics, model fitness using Akaike's information criterion (AIC) and calibration using pseudo Hosmer-Lemeshow test. The sensitivity, specificity and corresponding 95% confidence intervals were assessed for each risk score using the highest risk category ([20+ % at 10 years) as the "positive" test. Results: Recalibrated and original 1998 and original 2001 Framingham scores show better discrimination (>0.720) and model fitness (low AIC) for CHD and CVD. All 4 scores are correctly calibrated (Chi2<20). The recalibrated Framingham 1998 score has the best sensitivities, 37.8% and 40.4%, for CHD and CVD, respectively. All scores present specificities >90%. Framingham 1998, PROCAM and IAS-AGLA scores include the greatest proportion of subjects (>200) in the high risk category whereas recalibrated Framingham 2001 and Reynolds include <=44 subjects. Conclusion: In this cohort, we see variations of accuracy between risk scores, the original Framingham 2001 score demonstrating the best compromise between its accuracy and its limited selection of subjects in the highest risk category. We advocate that national guidelines, based on independently validated data, take into account calibrated CV risk scores for their respective countries.
Resumo:
In this paper, we develop a new decision making model and apply it in political Surveys of economic climate collect opinions of managers about the short-term future evolution of their business. Interviews are carried out on a regular basis and responses measure optimistic, neutral or pessimistic views about the economic perspectives. We propose a method to evaluate the sampling error of the average opinion derived from a particular type of survey data. Our variance estimate is useful to interpret historical trends and to decide whether changes in the index from one period to another are due to a structural change or whether ups and downs can be attributed to sampling randomness. An illustration using real data from a survey of business managers opinions is discussed.
Resumo:
Although cross-sectional diffusion tensor imaging (DTI) studies revealed significant white matter changes in mild cognitive impairment (MCI), the utility of this technique in predicting further cognitive decline is debated. Thirty-five healthy controls (HC) and 67 MCI subjects with DTI baseline data were neuropsychologically assessed at one year. Among them, there were 40 stable (sMCI; 9 single domain amnestic, 7 single domain frontal, 24 multiple domain) and 27 were progressive (pMCI; 7 single domain amnestic, 4 single domain frontal, 16 multiple domain). Fractional anisotropy (FA) and longitudinal, radial, and mean diffusivity were measured using Tract-Based Spatial Statistics. Statistics included group comparisons and individual classification of MCI cases using support vector machines (SVM). FA was significantly higher in HC compared to MCI in a distributed network including the ventral part of the corpus callosum, right temporal and frontal pathways. There were no significant group-level differences between sMCI versus pMCI or between MCI subtypes after correction for multiple comparisons. However, SVM analysis allowed for an individual classification with accuracies up to 91.4% (HC versus MCI) and 98.4% (sMCI versus pMCI). When considering the MCI subgroups separately, the minimum SVM classification accuracy for stable versus progressive cognitive decline was 97.5% in the multiple domain MCI group. SVM analysis of DTI data provided highly accurate individual classification of stable versus progressive MCI regardless of MCI subtype, indicating that this method may become an easily applicable tool for early individual detection of MCI subjects evolving to dementia.