929 resultados para Asymptotic behaviour, Bayesian methods, Mixture models, Overfitting, Posterior concentration
Resumo:
BACKGROUND: The estimation of demographic parameters from genetic data often requires the computation of likelihoods. However, the likelihood function is computationally intractable for many realistic evolutionary models, and the use of Bayesian inference has therefore been limited to very simple models. The situation changed recently with the advent of Approximate Bayesian Computation (ABC) algorithms allowing one to obtain parameter posterior distributions based on simulations not requiring likelihood computations. RESULTS: Here we present ABCtoolbox, a series of open source programs to perform Approximate Bayesian Computations (ABC). It implements various ABC algorithms including rejection sampling, MCMC without likelihood, a Particle-based sampler and ABC-GLM. ABCtoolbox is bundled with, but not limited to, a program that allows parameter inference in a population genetics context and the simultaneous use of different types of markers with different ploidy levels. In addition, ABCtoolbox can also interact with most simulation and summary statistics computation programs. The usability of the ABCtoolbox is demonstrated by inferring the evolutionary history of two evolutionary lineages of Microtus arvalis. Using nuclear microsatellites and mitochondrial sequence data in the same estimation procedure enabled us to infer sex-specific population sizes and migration rates and to find that males show smaller population sizes but much higher levels of migration than females. CONCLUSION: ABCtoolbox allows a user to perform all the necessary steps of a full ABC analysis, from parameter sampling from prior distributions, data simulations, computation of summary statistics, estimation of posterior distributions, model choice, validation of the estimation procedure, and visualization of the results.
Resumo:
Aim To assess the geographical transferability of niche-based species distribution models fitted with two modelling techniques. Location Two distinct geographical study areas in Switzerland and Austria, in the subalpine and alpine belts. Methods Generalized linear and generalized additive models (GLM and GAM) with a binomial probability distribution and a logit link were fitted for 54 plant species, based on topoclimatic predictor variables. These models were then evaluated quantitatively and used for spatially explicit predictions within (internal evaluation and prediction) and between (external evaluation and prediction) the two regions. Comparisons of evaluations and spatial predictions between regions and models were conducted in order to test if species and methods meet the criteria of full transferability. By full transferability, we mean that: (1) the internal evaluation of models fitted in region A and B must be similar; (2) a model fitted in region A must at least retain a comparable external evaluation when projected into region B, and vice-versa; and (3) internal and external spatial predictions have to match within both regions. Results The measures of model fit are, on average, 24% higher for GAMs than for GLMs in both regions. However, the differences between internal and external evaluations (AUC coefficient) are also higher for GAMs than for GLMs (a difference of 30% for models fitted in Switzerland and 54% for models fitted in Austria). Transferability, as measured with the AUC evaluation, fails for 68% of the species in Switzerland and 55% in Austria for GLMs (respectively for 67% and 53% of the species for GAMs). For both GAMs and GLMs, the agreement between internal and external predictions is rather weak on average (Kulczynski's coefficient in the range 0.3-0.4), but varies widely among individual species. The dominant pattern is an asymmetrical transferability between the two study regions (a mean decrease of 20% for the AUC coefficient when the models are transferred from Switzerland and 13% when they are transferred from Austria). Main conclusions The large inter-specific variability observed among the 54 study species underlines the need to consider more than a few species to test properly the transferability of species distribution models. The pronounced asymmetry in transferability between the two study regions may be due to peculiarities of these regions, such as differences in the ranges of environmental predictors or the varied impact of land-use history, or to species-specific reasons like differential phenotypic plasticity, existence of ecotypes or varied dependence on biotic interactions that are not properly incorporated into niche-based models. The lower variation between internal and external evaluation of GLMs compared to GAMs further suggests that overfitting may reduce transferability. Overall, a limited geographical transferability calls for caution when projecting niche-based models for assessing the fate of species in future environments.
Resumo:
Risk maps summarizing landscape suitability of novel areas for invading species can be valuable tools for preventing species' invasions or controlling their spread, but methods employed for development of such maps remain variable and unstandardized. We discuss several considerations in development of such models, including types of distributional information that should be used, the nature of explanatory variables that should be incorporated, and caveats regarding model testing and evaluation. We highlight that, in the case of invasive species, such distributional predictions should aim to derive the best hypothesis of the potential distribution of the species by using (1) all distributional information available, including information from both the native range and other invaded regions; (2) predictors linked as directly as is feasible to the physiological requirements of the species; and (3) modelling procedures that carefully avoid overfitting to the training data. Finally, model testing and evaluation should focus on well-predicted presences, and less on efficient prediction of absences; a k-fold regional cross-validation test is discussed.
Resumo:
This paper presents a validation study on statistical nonsupervised brain tissue classification techniques in magnetic resonance (MR) images. Several image models assuming different hypotheses regarding the intensity distribution model, the spatial model and the number of classes are assessed. The methods are tested on simulated data for which the classification ground truth is known. Different noise and intensity nonuniformities are added to simulate real imaging conditions. No enhancement of the image quality is considered either before or during the classification process. This way, the accuracy of the methods and their robustness against image artifacts are tested. Classification is also performed on real data where a quantitative validation compares the methods' results with an estimated ground truth from manual segmentations by experts. Validity of the various classification methods in the labeling of the image as well as in the tissue volume is estimated with different local and global measures. Results demonstrate that methods relying on both intensity and spatial information are more robust to noise and field inhomogeneities. We also demonstrate that partial volume is not perfectly modeled, even though methods that account for mixture classes outperform methods that only consider pure Gaussian classes. Finally, we show that simulated data results can also be extended to real data.
Resumo:
Yksi keskeisimmistä tehtävistä matemaattisten mallien tilastollisessa analyysissä on mallien tuntemattomien parametrien estimointi. Tässä diplomityössä ollaan kiinnostuneita tuntemattomien parametrien jakaumista ja niiden muodostamiseen sopivista numeerisista menetelmistä, etenkin tapauksissa, joissa malli on epälineaarinen parametrien suhteen. Erilaisten numeeristen menetelmien osalta pääpaino on Markovin ketju Monte Carlo -menetelmissä (MCMC). Nämä laskentaintensiiviset menetelmät ovat viime aikoina kasvattaneet suosiotaan lähinnä kasvaneen laskentatehon vuoksi. Sekä Markovin ketjujen että Monte Carlo -simuloinnin teoriaa on esitelty työssä siinä määrin, että menetelmien toimivuus saadaan perusteltua. Viime aikoina kehitetyistä menetelmistä tarkastellaan etenkin adaptiivisia MCMC menetelmiä. Työn lähestymistapa on käytännönläheinen ja erilaisia MCMC -menetelmien toteutukseen liittyviä asioita korostetaan. Työn empiirisessä osuudessa tarkastellaan viiden esimerkkimallin tuntemattomien parametrien jakaumaa käyttäen hyväksi teoriaosassa esitettyjä menetelmiä. Mallit kuvaavat kemiallisia reaktioita ja kuvataan tavallisina differentiaaliyhtälöryhminä. Mallit on kerätty kemisteiltä Lappeenrannan teknillisestä yliopistosta ja Åbo Akademista, Turusta.
Resumo:
This study offers a statistical analysis of the persistence of annual profits across a sample of firms from different European Union (EU) countries. To this end, a Bayesian dynamic model has been used which enables the annual behaviour of those profits to be broken down into a permanent structural component on the one hand and a transitory component on the other, while also distinguishing between general effects affecting the industry as a whole to which each firm belongs and specific effects affecting each firm in particular. This break down enables the relative importance of those fundamental components to be evaluated. The data analysed come from a sample of 23,293 firms in EU countries selected from the AMADEUS data-base. The period analysed ran from 1999 to 2007 and 21 sectors were analysed, chosen in such a way that there was a sufficiently large number of firms in each country*sector combination for the industry effects to be estimated accurately enough for meaningful comparisons to be made by sector and country. The analysis has been conducted by sector and by country from a Bayesian perspective, thus making the study more flexible and realistic since the estimates obtained do not depend on asymptotic results. In general terms, the study finds that, although the industry effects are significant, more important are the specific effects. That importance varies depending on the sector or the country in which the firm carries out its activity. The influence of firm effects accounts for more than 90% of total variation and display a significantly lower degree of persistence, with adjustment speeds oscillating around 51.1%. However, this pattern is not homogeneous but depends on the sector and country analysed. Industry effects have a more marginal importance, being significantly more persistent, with adjustment speeds oscillating around 10% with this degree of persistence being more homogeneous at both country and sector levels.
Resumo:
Probabilistic inversion methods based on Markov chain Monte Carlo (MCMC) simulation are well suited to quantify parameter and model uncertainty of nonlinear inverse problems. Yet, application of such methods to CPU-intensive forward models can be a daunting task, particularly if the parameter space is high dimensional. Here, we present a 2-D pixel-based MCMC inversion of plane-wave electromagnetic (EM) data. Using synthetic data, we investigate how model parameter uncertainty depends on model structure constraints using different norms of the likelihood function and the model constraints, and study the added benefits of joint inversion of EM and electrical resistivity tomography (ERT) data. Our results demonstrate that model structure constraints are necessary to stabilize the MCMC inversion results of a highly discretized model. These constraints decrease model parameter uncertainty and facilitate model interpretation. A drawback is that these constraints may lead to posterior distributions that do not fully include the true underlying model, because some of its features exhibit a low sensitivity to the EM data, and hence are difficult to resolve. This problem can be partly mitigated if the plane-wave EM data is augmented with ERT observations. The hierarchical Bayesian inverse formulation introduced and used herein is able to successfully recover the probabilistic properties of the measurement data errors and a model regularization weight. Application of the proposed inversion methodology to field data from an aquifer demonstrates that the posterior mean model realization is very similar to that derived from a deterministic inversion with similar model constraints.
Resumo:
Notre consommation en eau souterraine, en particulier comme eau potable ou pour l'irrigation, a considérablement augmenté au cours des années. De nombreux problèmes font alors leur apparition, allant de la prospection de nouvelles ressources à la remédiation des aquifères pollués. Indépendamment du problème hydrogéologique considéré, le principal défi reste la caractérisation des propriétés du sous-sol. Une approche stochastique est alors nécessaire afin de représenter cette incertitude en considérant de multiples scénarios géologiques et en générant un grand nombre de réalisations géostatistiques. Nous rencontrons alors la principale limitation de ces approches qui est le coût de calcul dû à la simulation des processus d'écoulements complexes pour chacune de ces réalisations. Dans la première partie de la thèse, ce problème est investigué dans le contexte de propagation de l'incertitude, oú un ensemble de réalisations est identifié comme représentant les propriétés du sous-sol. Afin de propager cette incertitude à la quantité d'intérêt tout en limitant le coût de calcul, les méthodes actuelles font appel à des modèles d'écoulement approximés. Cela permet l'identification d'un sous-ensemble de réalisations représentant la variabilité de l'ensemble initial. Le modèle complexe d'écoulement est alors évalué uniquement pour ce sousensemble, et, sur la base de ces réponses complexes, l'inférence est faite. Notre objectif est d'améliorer la performance de cette approche en utilisant toute l'information à disposition. Pour cela, le sous-ensemble de réponses approximées et exactes est utilisé afin de construire un modèle d'erreur, qui sert ensuite à corriger le reste des réponses approximées et prédire la réponse du modèle complexe. Cette méthode permet de maximiser l'utilisation de l'information à disposition sans augmentation perceptible du temps de calcul. La propagation de l'incertitude est alors plus précise et plus robuste. La stratégie explorée dans le premier chapitre consiste à apprendre d'un sous-ensemble de réalisations la relation entre les modèles d'écoulement approximé et complexe. Dans la seconde partie de la thèse, cette méthodologie est formalisée mathématiquement en introduisant un modèle de régression entre les réponses fonctionnelles. Comme ce problème est mal posé, il est nécessaire d'en réduire la dimensionnalité. Dans cette optique, l'innovation du travail présenté provient de l'utilisation de l'analyse en composantes principales fonctionnelles (ACPF), qui non seulement effectue la réduction de dimensionnalités tout en maximisant l'information retenue, mais permet aussi de diagnostiquer la qualité du modèle d'erreur dans cet espace fonctionnel. La méthodologie proposée est appliquée à un problème de pollution par une phase liquide nonaqueuse et les résultats obtenus montrent que le modèle d'erreur permet une forte réduction du temps de calcul tout en estimant correctement l'incertitude. De plus, pour chaque réponse approximée, une prédiction de la réponse complexe est fournie par le modèle d'erreur. Le concept de modèle d'erreur fonctionnel est donc pertinent pour la propagation de l'incertitude, mais aussi pour les problèmes d'inférence bayésienne. Les méthodes de Monte Carlo par chaîne de Markov (MCMC) sont les algorithmes les plus communément utilisés afin de générer des réalisations géostatistiques en accord avec les observations. Cependant, ces méthodes souffrent d'un taux d'acceptation très bas pour les problèmes de grande dimensionnalité, résultant en un grand nombre de simulations d'écoulement gaspillées. Une approche en deux temps, le "MCMC en deux étapes", a été introduite afin d'éviter les simulations du modèle complexe inutiles par une évaluation préliminaire de la réalisation. Dans la troisième partie de la thèse, le modèle d'écoulement approximé couplé à un modèle d'erreur sert d'évaluation préliminaire pour le "MCMC en deux étapes". Nous démontrons une augmentation du taux d'acceptation par un facteur de 1.5 à 3 en comparaison avec une implémentation classique de MCMC. Une question reste sans réponse : comment choisir la taille de l'ensemble d'entrainement et comment identifier les réalisations permettant d'optimiser la construction du modèle d'erreur. Cela requiert une stratégie itérative afin que, à chaque nouvelle simulation d'écoulement, le modèle d'erreur soit amélioré en incorporant les nouvelles informations. Ceci est développé dans la quatrième partie de la thèse, oú cette méthodologie est appliquée à un problème d'intrusion saline dans un aquifère côtier. -- Our consumption of groundwater, in particular as drinking water and for irrigation, has considerably increased over the years and groundwater is becoming an increasingly scarce and endangered resource. Nofadays, we are facing many problems ranging from water prospection to sustainable management and remediation of polluted aquifers. Independently of the hydrogeological problem, the main challenge remains dealing with the incomplete knofledge of the underground properties. Stochastic approaches have been developed to represent this uncertainty by considering multiple geological scenarios and generating a large number of realizations. The main limitation of this approach is the computational cost associated with performing complex of simulations in each realization. In the first part of the thesis, we explore this issue in the context of uncertainty propagation, where an ensemble of geostatistical realizations is identified as representative of the subsurface uncertainty. To propagate this lack of knofledge to the quantity of interest (e.g., the concentration of pollutant in extracted water), it is necessary to evaluate the of response of each realization. Due to computational constraints, state-of-the-art methods make use of approximate of simulation, to identify a subset of realizations that represents the variability of the ensemble. The complex and computationally heavy of model is then run for this subset based on which inference is made. Our objective is to increase the performance of this approach by using all of the available information and not solely the subset of exact responses. Two error models are proposed to correct the approximate responses follofing a machine learning approach. For the subset identified by a classical approach (here the distance kernel method) both the approximate and the exact responses are knofn. This information is used to construct an error model and correct the ensemble of approximate responses to predict the "expected" responses of the exact model. The proposed methodology makes use of all the available information without perceptible additional computational costs and leads to an increase in accuracy and robustness of the uncertainty propagation. The strategy explored in the first chapter consists in learning from a subset of realizations the relationship between proxy and exact curves. In the second part of this thesis, the strategy is formalized in a rigorous mathematical framework by defining a regression model between functions. As this problem is ill-posed, it is necessary to reduce its dimensionality. The novelty of the work comes from the use of functional principal component analysis (FPCA), which not only performs the dimensionality reduction while maximizing the retained information, but also allofs a diagnostic of the quality of the error model in the functional space. The proposed methodology is applied to a pollution problem by a non-aqueous phase-liquid. The error model allofs a strong reduction of the computational cost while providing a good estimate of the uncertainty. The individual correction of the proxy response by the error model leads to an excellent prediction of the exact response, opening the door to many applications. The concept of functional error model is useful not only in the context of uncertainty propagation, but also, and maybe even more so, to perform Bayesian inference. Monte Carlo Markov Chain (MCMC) algorithms are the most common choice to ensure that the generated realizations are sampled in accordance with the observations. Hofever, this approach suffers from lof acceptance rate in high dimensional problems, resulting in a large number of wasted of simulations. This led to the introduction of two-stage MCMC, where the computational cost is decreased by avoiding unnecessary simulation of the exact of thanks to a preliminary evaluation of the proposal. In the third part of the thesis, a proxy is coupled to an error model to provide an approximate response for the two-stage MCMC set-up. We demonstrate an increase in acceptance rate by a factor three with respect to one-stage MCMC results. An open question remains: hof do we choose the size of the learning set and identify the realizations to optimize the construction of the error model. This requires devising an iterative strategy to construct the error model, such that, as new of simulations are performed, the error model is iteratively improved by incorporating the new information. This is discussed in the fourth part of the thesis, in which we apply this methodology to a problem of saline intrusion in a coastal aquifer.
Resumo:
1. Species distribution models (SDMs) have become a standard tool in ecology and applied conservation biology. Modelling rare and threatened species is particularly important for conservation purposes. However, modelling rare species is difficult because the combination of few occurrences and many predictor variables easily leads to model overfitting. A new strategy using ensembles of small models was recently developed in an attempt to overcome this limitation of rare species modelling and has been tested successfully for only a single species so far. Here, we aim to test the approach more comprehensively on a large number of species including a transferability assessment. 2. For each species numerous small (here bivariate) models were calibrated, evaluated and averaged to an ensemble weighted by AUC scores. These 'ensembles of small models' (ESMs) were compared to standard Species Distribution Models (SDMs) using three commonly used modelling techniques (GLM, GBM, Maxent) and their ensemble prediction. We tested 107 rare and under-sampled plant species of conservation concern in Switzerland. 3. We show that ESMs performed significantly better than standard SDMs. The rarer the species, the more pronounced the effects were. ESMs were also superior to standard SDMs and their ensemble when they were independently evaluated using a transferability assessment. 4. By averaging simple small models to an ensemble, ESMs avoid overfitting without losing explanatory power through reducing the number of predictor variables. They further improve the reliability of species distribution models, especially for rare species, and thus help to overcome limitations of modelling rare species.
Resumo:
Over the past few decades, age estimation of living persons has represented a challenging task for many forensic services worldwide. In general, the process for age estimation includes the observation of the degree of maturity reached by some physical attributes, such as dentition or several ossification centers. The estimated chronological age or the probability that an individual belongs to a meaningful class of ages is then obtained from the observed degree of maturity by means of various statistical methods. Among these methods, those developed in a Bayesian framework offer to users the possibility of coherently dealing with the uncertainty associated with age estimation and of assessing in a transparent and logical way the probability that an examined individual is younger or older than a given age threshold. Recently, a Bayesian network for age estimation has been presented in scientific literature; this kind of probabilistic graphical tool may facilitate the use of the probabilistic approach. Probabilities of interest in the network are assigned by means of transition analysis, a statistical parametric model, which links the chronological age and the degree of maturity by means of specific regression models, such as logit or probit models. Since different regression models can be employed in transition analysis, the aim of this paper is to study the influence of the model in the classification of individuals. The analysis was performed using a dataset related to the ossifications status of the medial clavicular epiphysis and results support that the classification of individuals is not dependent on the choice of the regression model.
Resumo:
OBJECTIVE: We evaluated whether regional differences in physical activity (PA) and sedentary behaviour (SB) existed along language boundaries within Switzerland and whether potential differences would be explained by socio-demographics or environmental characteristics. METHODS: We combined data of 611 children aged 4 to 7 years from four regional studies. PA and SB were assessed by accelerometers. Information about the socio-demographic background was obtained by questionnaires. Objective neighbourhood attributes could be linked to home addresses. Multivariate regression models were used to test associations between PA and SB and socio-demographic characteristics and neighbourhood attributes. RESULTS: Children from the German compared to the French-speaking region were more physically active and less sedentary (by 10-15 %, p < 0.01). Although German-speaking children lived in a more favourable environment and a higher socioeconomic neighbourhood (differences p < 0.001), these characteristics did not explain the differences in PA behaviour between French and German speaking. CONCLUSIONS: Factors related to the language region, which might be culturally rooted were among the strongest correlates of PA and SB among Swiss children, independent of individual, social and environmental factors.
Resumo:
Construction of multiple sequence alignments is a fundamental task in Bioinformatics. Multiple sequence alignments are used as a prerequisite in many Bioinformatics methods, and subsequently the quality of such methods can be critically dependent on the quality of the alignment. However, automatic construction of a multiple sequence alignment for a set of remotely related sequences does not always provide biologically relevant alignments.Therefore, there is a need for an objective approach for evaluating the quality of automatically aligned sequences. The profile hidden Markov model is a powerful approach in comparative genomics. In the profile hidden Markov model, the symbol probabilities are estimated at each conserved alignment position. This can increase the dimension of parameter space and cause an overfitting problem. These two research problems are both related to conservation. We have developed statistical measures for quantifying the conservation of multiple sequence alignments. Two types of methods are considered, those identifying conserved residues in an alignment position, and those calculating positional conservation scores. The positional conservation score was exploited in a statistical prediction model for assessing the quality of multiple sequence alignments. The residue conservation score was used as part of the emission probability estimation method proposed for profile hidden Markov models. The results of the predicted alignment quality score highly correlated with the correct alignment quality scores, indicating that our method is reliable for assessing the quality of any multiple sequence alignment. The comparison of the emission probability estimation method with the maximum likelihood method showed that the number of estimated parameters in the model was dramatically decreased, while the same level of accuracy was maintained. To conclude, we have shown that conservation can be successfully used in the statistical model for alignment quality assessment and in the estimation of emission probabilities in the profile hidden Markov models.
Resumo:
In mathematical modeling the estimation of the model parameters is one of the most common problems. The goal is to seek parameters that fit to the measurements as well as possible. There is always error in the measurements which implies uncertainty to the model estimates. In Bayesian statistics all the unknown quantities are presented as probability distributions. If there is knowledge about parameters beforehand, it can be formulated as a prior distribution. The Bays’ rule combines the prior and the measurements to posterior distribution. Mathematical models are typically nonlinear, to produce statistics for them requires efficient sampling algorithms. In this thesis both Metropolis-Hastings (MH), Adaptive Metropolis (AM) algorithms and Gibbs sampling are introduced. In the thesis different ways to present prior distributions are introduced. The main issue is in the measurement error estimation and how to obtain prior knowledge for variance or covariance. Variance and covariance sampling is combined with the algorithms above. The examples of the hyperprior models are applied to estimation of model parameters and error in an outlier case.
Resumo:
In this paper, we obtain sharp asymptotic formulas with error estimates for the Mellin con- volution of functions de ned on (0;1), and use these formulas to characterize the asymptotic behavior of marginal distribution densities of stock price processes in mixed stochastic models. Special examples of mixed models are jump-di usion models and stochastic volatility models with jumps. We apply our general results to the Heston model with double exponential jumps, and make a detailed analysis of the asymptotic behavior of the stock price density, the call option pricing function, and the implied volatility in this model. We also obtain similar results for the Heston model with jumps distributed according to the NIG law.
Resumo:
Stratospheric ozone can be measured accurately using a limb scatter remote sensing technique at the UV-visible spectral region of solar light. The advantages of this technique includes a good vertical resolution and a good daytime coverage of the measurements. In addition to ozone, UV-visible limb scatter measurements contain information about NO2, NO3, OClO, BrO and aerosols. There are currently several satellite instruments continuously scanning the atmosphere and measuring the UVvisible region of the spectrum, e.g., the Optical Spectrograph and Infrared Imager System (OSIRIS) launched on the Odin satellite in February 2001, and the Scanning Imaging Absorption SpectroMeter for Atmospheric CartograpHY (SCIAMACHY) launched on Envisat in March 2002. Envisat also carries the Global Ozone Monitoring by Occultation of Stars (GOMOS) instrument, which also measures limb-scattered sunlight under bright limb occultation conditions. These conditions occur during daytime occultation measurements. The global coverage of the satellite measurements is far better than any other ozone measurement technique, but still the measurements are sparse in the spatial domain. Measurements are also repeated relatively rarely over a certain area, and the composition of the Earth’s atmosphere changes dynamically. Assimilation methods are therefore needed in order to combine the information of the measurements with the atmospheric model. In recent years, the focus of assimilation algorithm research has turned towards filtering methods. The traditional Extended Kalman filter (EKF) method takes into account not only the uncertainty of the measurements, but also the uncertainty of the evolution model of the system. However, the computational cost of full blown EKF increases rapidly as the number of the model parameters increases. Therefore the EKF method cannot be applied directly to the stratospheric ozone assimilation problem. The work in this thesis is devoted to the development of inversion methods for satellite instruments and the development of assimilation methods used with atmospheric models.