19 resultados para inference by Kriging
em CentAUR: Central Archive University of Reading - UK
Resumo:
The rate at which a given site in a gene sequence alignment evolves over time may vary. This phenomenon-known as heterotachy-can bias or distort phylogenetic trees inferred from models of sequence evolution that assume rates of evolution are constant. Here, we describe a phylogenetic mixture model designed to accommodate heterotachy. The method sums the likelihood of the data at each site over more than one set of branch lengths on the same tree topology. A branch-length set that is best for one site may differ from the branch-length set that is best for some other site, thereby allowing different sites to have different rates of change throughout the tree. Because rate variation may not be present in all branches, we use a reversible-jump Markov chain Monte Carlo algorithm to identify those branches in which reliable amounts of heterotachy occur. We implement the method in combination with our 'pattern-heterogeneity' mixture model, applying it to simulated data and five published datasets. We find that complex evolutionary signals of heterotachy are routinely present over and above variation in the rate or pattern of evolution across sites, that the reversible-jump method requires far fewer parameters than conventional mixture models to describe it, and serves to identify the regions of the tree in which heterotachy is most pronounced. The reversible-jump procedure also removes the need for a posteriori tests of 'significance' such as the Akaike or Bayesian information criterion tests, or Bayes factors. Heterotachy has important consequences for the correct reconstruction of phylogenies as well as for tests of hypotheses that rely on accurate branch-length information. These include molecular clocks, analyses of tempo and mode of evolution, comparative studies and ancestral state reconstruction. The model is available from the authors' website, and can be used for the analysis of both nucleotide and morphological data.
Resumo:
The variogram is essential for local estimation and mapping of any variable by kriging. The variogram itself must usually be estimated from sample data. The sampling density is a compromise between precision and cost, but it must be sufficiently dense to encompass the principal spatial sources of variance. A nested, multi-stage, sampling with separating distances increasing in geometric progression from stage to stage will do that. The data may then be analyzed by a hierarchical analysis of variance to estimate the components of variance for every stage, and hence lag. By accumulating the components starting from the shortest lag one obtains a rough variogram for modest effort. For balanced designs the analysis of variance is optimal; for unbalanced ones, however, these estimators are not necessarily the best, and the analysis by residual maximum likelihood (REML) will usually be preferable. The paper summarizes the underlying theory and illustrates its application with data from three surveys, one in which the design had four stages and was balanced and two implemented with unbalanced designs to economize when there were more stages. A Fortran program is available for the analysis of variance, and code for the REML analysis is listed in the paper. (c) 2005 Elsevier Ltd. All rights reserved.
Resumo:
The variogram is essential for local estimation and mapping of any variable by kriging. The variogram itself must usually be estimated from sample data. The sampling density is a compromise between precision and cost, but it must be sufficiently dense to encompass the principal spatial sources of variance. A nested, multi-stage, sampling with separating distances increasing in geometric progression from stage to stage will do that. The data may then be analyzed by a hierarchical analysis of variance to estimate the components of variance for every stage, and hence lag. By accumulating the components starting from the shortest lag one obtains a rough variogram for modest effort. For balanced designs the analysis of variance is optimal; for unbalanced ones, however, these estimators are not necessarily the best, and the analysis by residual maximum likelihood (REML) will usually be preferable. The paper summarizes the underlying theory and illustrates its application with data from three surveys, one in which the design had four stages and was balanced and two implemented with unbalanced designs to economize when there were more stages. A Fortran program is available for the analysis of variance, and code for the REML analysis is listed in the paper. (c) 2005 Elsevier Ltd. All rights reserved.
Resumo:
Weeds tend to aggregate in patches within fields and there is evidence that this is partly owing to variation in soil properties. Because the processes driving soil heterogeneity operate at different scales, the strength of the relationships between soil properties and weed density would also be expected to be scale-dependent. Quantifying these effects of scale on weed patch dynamics is essential to guide the design of discrete sampling protocols for mapping weed distribution. We have developed a general method that uses novel within-field nested sampling and residual maximum likelihood (REML) estimation to explore scale-dependent relationships between weeds and soil properties. We have validated the method using a case study of Alopecurus myosuroides in winter wheat. Using REML, we partitioned the variance and covariance into scale-specific components and estimated the correlations between the weed counts and soil properties at each scale. We used variograms to quantify the spatial structure in the data and to map variables by kriging. Our methodology successfully captured the effect of scale on a number of edaphic drivers of weed patchiness. The overall Pearson correlations between A. myosuroides and soil organic matter and clay content were weak and masked the stronger correlations at >50 m. Knowing how the variance was partitioned across the spatial scales we optimized the sampling design to focus sampling effort at those scales that contributed most to the total variance. The methods have the potential to guide patch spraying of weeds by identifying areas of the field that are vulnerable to weed establishment.
Resumo:
Systems Engineering often involves computer modelling the behaviour of proposed systems and their components. Where a component is human, fallibility must be modelled by a stochastic agent. The identification of a model of decision-making over quantifiable options is investigated using the game-domain of Chess. Bayesian methods are used to infer the distribution of players’ skill levels from the moves they play rather than from their competitive results. The approach is used on large sets of games by players across a broad FIDE Elo range, and is in principle applicable to any scenario where high-value decisions are being made under pressure.
Resumo:
Bayesian inference has been used to determine rigorous estimates of hydroxyl radical concentrations () and air mass dilution rates (K) averaged following air masses between linked observations of nonmethane hydrocarbons (NMHCs) spanning the North Atlantic during the Intercontinental Transport and Chemical Transformation (ITCT)-Lagrangian-2K4 experiment. The Bayesian technique obtains a refined (posterior) distribution of a parameter given data related to the parameter through a model and prior beliefs about the parameter distribution. Here, the model describes hydrocarbon loss through OH reaction and mixing with a background concentration at rate K. The Lagrangian experiment provides direct observations of hydrocarbons at two time points, removing assumptions regarding composition or sources upstream of a single observation. The estimates are sharpened by using many hydrocarbons with different reactivities and accounting for their variability and measurement uncertainty. A novel technique is used to construct prior background distributions of many species, described by variation of a single parameter . This exploits the high correlation of species, related by the first principal component of many NMHC samples. The Bayesian method obtains posterior estimates of , K and following each air mass. Median values are typically between 0.5 and 2.0 × 106 molecules cm−3, but are elevated to between 2.5 and 3.5 × 106 molecules cm−3, in low-level pollution. A comparison of estimates from absolute NMHC concentrations and NMHC ratios assuming zero background (the “photochemical clock” method) shows similar distributions but reveals systematic high bias in the estimates from ratios. Estimates of K are ∼0.1 day−1 but show more sensitivity to the prior distribution assumed.
Resumo:
Data such as digitized aerial photographs, electrical conductivity and yield are intensive and relatively inexpensive to obtain compared with collecting soil data by sampling. If such ancillary data are co-regionalized with the soil data they should be suitable for co-kriging. The latter requires that information for both variables is co-located at several locations; this is rarely so for soil and ancillary data. To solve this problem, we have derived values for the ancillary variable at the soil sampling locations by averaging the values within a radius of 15 m, taking the nearest-neighbour value, kriging over 5 m blocks, and punctual kriging. The cross-variograms from these data with clay content and also the pseudo cross-variogram were used to co-krige to validation points and the root mean squared errors (RMSEs) were calculated. In general, the data averaged within 15m and the punctually kriged values resulted in more accurate predictions.
Resumo:
It has been generally accepted that the method of moments (MoM) variogram, which has been widely applied in soil science, requires about 100 sites at an appropriate interval apart to describe the variation adequately. This sample size is often larger than can be afforded for soil surveys of agricultural fields or contaminated sites. Furthermore, it might be a much larger sample size than is needed where the scale of variation is large. A possible alternative in such situations is the residual maximum likelihood (REML) variogram because fewer data appear to be required. The REML method is parametric and is considered reliable where there is trend in the data because it is based on generalized increments that filter trend out and only the covariance parameters are estimated. Previous research has suggested that fewer data are needed to compute a reliable variogram using a maximum likelihood approach such as REML, however, the results can vary according to the nature of the spatial variation. There remain issues to examine: how many fewer data can be used, how should the sampling sites be distributed over the site of interest, and how do different degrees of spatial variation affect the data requirements? The soil of four field sites of different size, physiography, parent material and soil type was sampled intensively, and MoM and REML variograms were calculated for clay content. The data were then sub-sampled to give different sample sizes and distributions of sites and the variograms were computed again. The model parameters for the sets of variograms for each site were used for cross-validation. Predictions based on REML variograms were generally more accurate than those from MoM variograms with fewer than 100 sampling sites. A sample size of around 50 sites at an appropriate distance apart, possibly determined from variograms of ancillary data, appears adequate to compute REML variograms for kriging soil properties for precision agriculture and contaminated sites. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
This paper describes the user modeling component of EPIAIM, a consultation system for data analysis in epidemiology. The component is aimed at representing knowledge of concepts in the domain, so that their explanations can be adapted to user needs. The first part of the paper describes two studies aimed at analysing user requirements. The first one is a questionnaire study which examines the respondents' familiarity with concepts. The second one is an analysis of concept descriptions in textbooks and from expert epidemiologists, which examines how discourse strategies are tailored to the level of experience of the expected audience. The second part of the paper describes how the results of these studies have been used to design the user modeling component of EPIAIM. This module works in a two-step approach. In the first step, a few trigger questions allow the activation of a stereotype that includes a "body" and an "inference component". The body is the representation of the body of knowledge that a class of users is expected to know, along with the probability that the knowledge is known. In the inference component, the learning process of concepts is represented as a belief network. Hence, in the second step the belief network is used to refine the initial default information in the stereotype's body. This is done by asking a few questions on those concepts where it is uncertain whether or not they are known to the user, and propagating this new evidence to revise the whole situation. The system has been implemented on a workstation under UNIX. An example of functioning is presented, and advantages and limitations of the approach are discussed.
Modeling of atmospheric effects on InSAR measurements by incorporating terrain elevation information
Resumo:
We propose an elevation-dependent calibratory method to correct for the water vapour-induced delays over Mt. Etna that affect the interferometric syntheric aperture radar (InSAR) results. Water vapour delay fields are modelled from individual zenith delay estimates on a network of continuous GPS receivers. These are interpolated using simple kriging with varying local means over two domains, above and below 2 km in altitude. Test results with data from a meteorological station and 14 continuous GPS stations over Mt. Etna show that a reduction of the mean phase delay field of about 27% is achieved after the model is applied to a 35-day interferogram. (C) 2006 Elsevier Ltd. All rights reserved.
Resumo:
Stephens and Donnelly have introduced a simple yet powerful importance sampling scheme for computing the likelihood in population genetic models. Fundamental to the method is an approximation to the conditional probability of the allelic type of an additional gene, given those currently in the sample. As noted by Li and Stephens, the product of these conditional probabilities for a sequence of draws that gives the frequency of allelic types in a sample is an approximation to the likelihood, and can be used directly in inference. The aim of this note is to demonstrate the high level of accuracy of "product of approximate conditionals" (PAC) likelihood when used with microsatellite data. Results obtained on simulated microsatellite data show that this strategy leads to a negligible bias over a wide range of the scaled mutation parameter theta. Furthermore, the sampling variance of likelihood estimates as well as the computation time are lower than that obtained with importance sampling on the whole range of theta. It follows that this approach represents an efficient substitute to IS algorithms in computer intensive (e.g. MCMC) inference methods in population genetics. (c) 2006 Elsevier Inc. All rights reserved.
Resumo:
Event-related functional magnetic resonance imaging (efMRI) has emerged as a powerful technique for detecting brains' responses to presented stimuli. A primary goal in efMRI data analysis is to estimate the Hemodynamic Response Function (HRF) and to locate activated regions in human brains when specific tasks are performed. This paper develops new methodologies that are important improvements not only to parametric but also to nonparametric estimation and hypothesis testing of the HRF. First, an effective and computationally fast scheme for estimating the error covariance matrix for efMRI is proposed. Second, methodologies for estimation and hypothesis testing of the HRF are developed. Simulations support the effectiveness of our proposed methods. When applied to an efMRI dataset from an emotional control study, our method reveals more meaningful findings than the popular methods offered by AFNI and FSL. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
The utility of an "ecologically rational" recognition-based decision rule in multichoice decision problems is analyzed, varying the type of judgment required (greater or lesser). The maximum size and range of a counterintuitive advantage associated with recognition-based judgment (the "less-is-more effect") is identified for a range of cue validity values. Greater ranges of the less-is-more effect occur when participants are asked which is the greatest of to choices (m > 2) than which is the least. Less-is-more effects also have greater range for larger values of in. This implies that the classic two-altemative forced choice task, as studied by Goldstein and Gigerenzer (2002), may not be the most appropriate test case for less-is-more effects.
Resumo:
The usefulness of any simulation of atmospheric tracers using low-resolution winds relies on both the dominance of large spatial scales in the strain and time dependence that results in a cascade in tracer scales. Here, a quantitative study on the accuracy of such tracer studies is made using the contour advection technique. It is shown that, although contour stretching rates are very insensitive to the spatial truncation of the wind field, the displacement errors in filament position are sensitive. A knowledge of displacement characteristics is essential if Lagrangian simulations are to be used for the inference of airmass origin. A quantitative lower estimate is obtained for the tracer scale factor (TSF): the ratio of the smallest resolved scale in the advecting wind field to the smallest “trustworthy” scale in the tracer field. For a baroclinic wave life cycle the TSF = 6.1 ± 0.3 while for the Northern Hemisphere wintertime lower stratosphere the TSF = 5.5 ± 0.5, when using the most stringent definition of the trustworthy scale. The similarity in the TSF for the two flows is striking and an explanation is discussed in terms of the activity of potential vorticity (PV) filaments. Uncertainty in contour initialization is investigated for the stratospheric case. The effect of smoothing initial contours is to introduce a spinup time, after which wind field truncation errors take over from initialization errors (2–3 days). It is also shown that false detail from the proliferation of finescale filaments limits the useful lifetime of such contour advection simulations to 3σ−1 days, where σ is the filament thinning rate, unless filaments narrower than the trustworthy scale are removed by contour surgery. In addition, PV analysis error and diabatic effects are so strong that only PV filaments wider than 50 km are at all believable, even for very high-resolution winds. The minimum wind field resolution required to accurately simulate filaments down to the erosion scale in the stratosphere (given an initial contour) is estimated and the implications for the modeling of atmospheric chemistry are briefly discussed.