940 resultados para MCMC, Metropolis Hastings, Gibbs, Bayesian, OBMC, slice sampler, Python


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This work investigates mathematical details and computational aspects of Metropolis-Hastings reptation quantum Monte Carlo and its variants, in addition to the Bounce method and its variants. The issues that concern us include the sensitivity of these algorithms' target densities to the position of the trial electron density along the reptile, time-reversal symmetry of the propagators, and the length of the reptile. We calculate the ground-state energy and one-electron properties of LiH at its equilibrium geometry for all these algorithms. The importance sampling is performed with a single-determinant large Slater-type orbitals (STO) basis set. The computer codes were written to exploit the efficiencies engineered into modern, high-performance computing software. Using the Bounce method in the calculation of non-energy-related properties, those represented by operators that do not commute with the Hamiltonian, is a novel work. We found that the unmodified Bounce gives good ground state energy and very good one-electron properties. We attribute this to its favourable time-reversal symmetry in its target density's Green's functions. Breaking this symmetry gives poorer results. Use of a short reptile in the Bounce method does not alter the quality of the results. This suggests that in future applications one can use a shorter reptile to cut down the computational time dramatically.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cette thèse porte sur les phénomènes critiques survenant dans les modèles bidimensionnels sur réseau. Les résultats sont l'objet de deux articles : le premier porte sur la mesure d'exposants critiques décrivant des objets géométriques du réseau et, le second, sur la construction d'idempotents projetant sur des modules indécomposables de l'algèbre de Temperley-Lieb pour la chaîne de spins XXZ. Le premier article présente des expériences numériques Monte Carlo effectuées pour une famille de modèles de boucles en phase diluée. Baptisés "dilute loop models (DLM)", ceux-ci sont inspirés du modèle O(n) introduit par Nienhuis (1990). La famille est étiquetée par les entiers relativement premiers p et p' ainsi que par un paramètre d'anisotropie. Dans la limite thermodynamique, il est pressenti que le modèle DLM(p,p') soit décrit par une théorie logarithmique des champs conformes de charge centrale c(\kappa)=13-6(\kappa+1/\kappa), où \kappa=p/p' est lié à la fugacité du gaz de boucles \beta=-2\cos\pi/\kappa, pour toute valeur du paramètre d'anisotropie. Les mesures portent sur les exposants critiques représentant la loi d'échelle des objets géométriques suivants : l'interface, le périmètre externe et les liens rouges. L'algorithme Metropolis-Hastings employé, pour lequel nous avons introduit de nombreuses améliorations spécifiques aux modèles dilués, est détaillé. Un traitement statistique rigoureux des données permet des extrapolations coïncidant avec les prédictions théoriques à trois ou quatre chiffres significatifs, malgré des courbes d'extrapolation aux pentes abruptes. Le deuxième article porte sur la décomposition de l'espace de Hilbert \otimes^nC^2 sur lequel la chaîne XXZ de n spins 1/2 agit. La version étudiée ici (Pasquier et Saleur (1990)) est décrite par un hamiltonien H_{XXZ}(q) dépendant d'un paramètre q\in C^\times et s'exprimant comme une somme d'éléments de l'algèbre de Temperley-Lieb TL_n(q). Comme pour les modèles dilués, le spectre de la limite continue de H_{XXZ}(q) semble relié aux théories des champs conformes, le paramètre q déterminant la charge centrale. Les idempotents primitifs de End_{TL_n}\otimes^nC^2 sont obtenus, pour tout q, en termes d'éléments de l'algèbre quantique U_qsl_2 (ou d'une extension) par la dualité de Schur-Weyl quantique. Ces idempotents permettent de construire explicitement les TL_n-modules indécomposables de \otimes^nC^2. Ceux-ci sont tous irréductibles, sauf si q est une racine de l'unité. Cette exception est traitée séparément du cas où q est générique. Les problèmes résolus par ces articles nécessitent une grande variété de résultats et d'outils. Pour cette raison, la thèse comporte plusieurs chapitres préparatoires. Sa structure est la suivante. Le premier chapitre introduit certains concepts communs aux deux articles, notamment une description des phénomènes critiques et de la théorie des champs conformes. Le deuxième chapitre aborde brièvement la question des champs logarithmiques, l'évolution de Schramm-Loewner ainsi que l'algorithme de Metropolis-Hastings. Ces sujets sont nécessaires à la lecture de l'article "Geometric Exponents of Dilute Loop Models" au chapitre 3. Le quatrième chapitre présente les outils algébriques utilisés dans le deuxième article, "The idempotents of the TL_n-module \otimes^nC^2 in terms of elements of U_qsl_2", constituant le chapitre 5. La thèse conclut par un résumé des résultats importants et la proposition d'avenues de recherche qui en découlent.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A multivariate frailty hazard model is developed for joint-modeling of three correlated time-to-event outcomes: (1) local recurrence, (2) distant recurrence, and (3) overall survival. The term frailty is introduced to model population heterogeneity. The dependence is modeled by conditioning on a shared frailty that is included in the three hazard functions. Independent variables can be included in the model as covariates. The Markov chain Monte Carlo methods are used to estimate the posterior distributions of model parameters. The algorithm used in present application is the hybrid Metropolis-Hastings algorithm, which simultaneously updates all parameters with evaluations of gradient of log posterior density. The performance of this approach is examined based on simulation studies using Exponential and Weibull distributions. We apply the proposed methods to a study of patients with soft tissue sarcoma, which motivated this research. Our results indicate that patients with chemotherapy had better overall survival with hazard ratio of 0.242 (95% CI: 0.094 - 0.564) and lower risk of distant recurrence with hazard ratio of 0.636 (95% CI: 0.487 - 0.860), but not significantly better in local recurrence with hazard ratio of 0.799 (95% CI: 0.575 - 1.054). The advantages and limitations of the proposed models, and future research directions are discussed. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thèse numérisée par la Direction des bibliothèques de l'Université de Montréal.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thèse numérisée par la Direction des bibliothèques de l'Université de Montréal.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The recent advent of new technologies has led to huge amounts of genomic data. With these data come new opportunities to understand biological cellular processes underlying hidden regulation mechanisms and to identify disease related biomarkers for informative diagnostics. However, extracting biological insights from the immense amounts of genomic data is a challenging task. Therefore, effective and efficient computational techniques are needed to analyze and interpret genomic data. In this thesis, novel computational methods are proposed to address such challenges: a Bayesian mixture model, an extended Bayesian mixture model, and an Eigen-brain approach. The Bayesian mixture framework involves integration of the Bayesian network and the Gaussian mixture model. Based on the proposed framework and its conjunction with K-means clustering and principal component analysis (PCA), biological insights are derived such as context specific/dependent relationships and nested structures within microarray where biological replicates are encapsulated. The Bayesian mixture framework is then extended to explore posterior distributions of network space by incorporating a Markov chain Monte Carlo (MCMC) model. The extended Bayesian mixture model summarizes the sampled network structures by extracting biologically meaningful features. Finally, an Eigen-brain approach is proposed to analyze in situ hybridization data for the identification of the cell-type specific genes, which can be useful for informative blood diagnostics. Computational results with region-based clustering reveals the critical evidence for the consistency with brain anatomical structure.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The generalized Gibbs sampler (GGS) is a recently developed Markov chain Monte Carlo (MCMC) technique that enables Gibbs-like sampling of state spaces that lack a convenient representation in terms of a fixed coordinate system. This paper describes a new sampler, called the tree sampler, which uses the GGS to sample from a state space consisting of phylogenetic trees. The tree sampler is useful for a wide range of phylogenetic applications, including Bayesian, maximum likelihood, and maximum parsimony methods. A fast new algorithm to search for a maximum parsimony phylogeny is presented, using the tree sampler in the context of simulated annealing. The mathematics underlying the algorithm is explained and its time complexity is analyzed. The method is tested on two large data sets consisting of 123 sequences and 500 sequences, respectively. The new algorithm is shown to compare very favorably in terms of speed and accuracy to the program DNAPARS from the PHYLIP package.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The multivariate skew-t distribution (J Multivar Anal 79:93-113, 2001; J R Stat Soc, Ser B 65:367-389, 2003; Statistics 37:359-363, 2003) includes the Student t, skew-Cauchy and Cauchy distributions as special cases and the normal and skew-normal ones as limiting cases. In this paper, we explore the use of Markov Chain Monte Carlo (MCMC) methods to develop a Bayesian analysis of repeated measures, pretest/post-test data, under multivariate null intercept measurement error model (J Biopharm Stat 13(4):763-771, 2003) where the random errors and the unobserved value of the covariate (latent variable) follows a Student t and skew-t distribution, respectively. The results and methods are numerically illustrated with an example in the field of dentistry.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the author's website (www.uq.edu.au/similar to uqjkeith/) or upon request to the author.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We develop, implement and study a new Bayesian spatial mixture model (BSMM). The proposed BSMM allows for spatial structure in the binary activation indicators through a latent thresholded Gaussian Markov random field. We develop a Gibbs (MCMC) sampler to perform posterior inference on the model parameters, which then allows us to assess the posterior probabilities of activation for each voxel. One purpose of this article is to compare the HJ model and the BSMM in terms of receiver operating characteristics (ROC) curves. Also we consider the accuracy of the spatial mixture model and the BSMM for estimation of the size of the activation region in terms of bias, variance and mean squared error. We perform a simulation study to examine the aforementioned characteristics under a variety of configurations of spatial mixture model and BSMM both as the size of the region changes and as the magnitude of activation changes.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.

Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.

One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.

Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.

Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.

The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.

Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Understanding how virus strains offer protection against closely related emerging strains is vital for creating effective vaccines. For many viruses, including Foot-and-Mouth Disease Virus (FMDV) and the Influenza virus where multiple serotypes often co-circulate, in vitro testing of large numbers of vaccines can be infeasible. Therefore the development of an in silico predictor of cross-protection between strains is important to help optimise vaccine choice. Vaccines will offer cross-protection against closely related strains, but not against those that are antigenically distinct. To be able to predict cross-protection we must understand the antigenic variability within a virus serotype, distinct lineages of a virus, and identify the antigenic residues and evolutionary changes that cause the variability. In this thesis we present a family of sparse hierarchical Bayesian models for detecting relevant antigenic sites in virus evolution (SABRE), as well as an extended version of the method, the extended SABRE (eSABRE) method, which better takes into account the data collection process. The SABRE methods are a family of sparse Bayesian hierarchical models that use spike and slab priors to identify sites in the viral protein which are important for the neutralisation of the virus. In this thesis we demonstrate how the SABRE methods can be used to identify antigenic residues within different serotypes and show how the SABRE method outperforms established methods, mixed-effects models based on forward variable selection or l1 regularisation, on both synthetic and viral datasets. In addition we also test a number of different versions of the SABRE method, compare conjugate and semi-conjugate prior specifications and an alternative to the spike and slab prior; the binary mask model. We also propose novel proposal mechanisms for the Markov chain Monte Carlo (MCMC) simulations, which improve mixing and convergence over that of the established component-wise Gibbs sampler. The SABRE method is then applied to datasets from FMDV and the Influenza virus in order to identify a number of known antigenic residue and to provide hypotheses of other potentially antigenic residues. We also demonstrate how the SABRE methods can be used to create accurate predictions of the important evolutionary changes of the FMDV serotypes. In this thesis we provide an extended version of the SABRE method, the eSABRE method, based on a latent variable model. The eSABRE method takes further into account the structure of the datasets for FMDV and the Influenza virus through the latent variable model and gives an improvement in the modelling of the error. We show how the eSABRE method outperforms the SABRE methods in simulation studies and propose a new information criterion for selecting the random effects factors that should be included in the eSABRE method; block integrated Widely Applicable Information Criterion (biWAIC). We demonstrate how biWAIC performs equally to two other methods for selecting the random effects factors and combine it with the eSABRE method to apply it to two large Influenza datasets. Inference in these large datasets is computationally infeasible with the SABRE methods, but as a result of the improved structure of the likelihood, we are able to show how the eSABRE method offers a computational improvement, leading it to be used on these datasets. The results of the eSABRE method show that we can use the method in a fully automatic manner to identify a large number of antigenic residues on a variety of the antigenic sites of two Influenza serotypes, as well as making predictions of a number of nearby sites that may also be antigenic and are worthy of further experiment investigation.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

BACKGROUND: The estimation of demographic parameters from genetic data often requires the computation of likelihoods. However, the likelihood function is computationally intractable for many realistic evolutionary models, and the use of Bayesian inference has therefore been limited to very simple models. The situation changed recently with the advent of Approximate Bayesian Computation (ABC) algorithms allowing one to obtain parameter posterior distributions based on simulations not requiring likelihood computations. RESULTS: Here we present ABCtoolbox, a series of open source programs to perform Approximate Bayesian Computations (ABC). It implements various ABC algorithms including rejection sampling, MCMC without likelihood, a Particle-based sampler and ABC-GLM. ABCtoolbox is bundled with, but not limited to, a program that allows parameter inference in a population genetics context and the simultaneous use of different types of markers with different ploidy levels. In addition, ABCtoolbox can also interact with most simulation and summary statistics computation programs. The usability of the ABCtoolbox is demonstrated by inferring the evolutionary history of two evolutionary lineages of Microtus arvalis. Using nuclear microsatellites and mitochondrial sequence data in the same estimation procedure enabled us to infer sex-specific population sizes and migration rates and to find that males show smaller population sizes but much higher levels of migration than females. CONCLUSION: ABCtoolbox allows a user to perform all the necessary steps of a full ABC analysis, from parameter sampling from prior distributions, data simulations, computation of summary statistics, estimation of posterior distributions, model choice, validation of the estimation procedure, and visualization of the results.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Geophysical methods have the potential to provide valuable information on hydrological properties in the unsaturated zone. In particular, time-lapse geophysical data, when coupled with a hydrological model and inverted stochastically, may allow for the effective estimation of subsurface hydraulic parameters and their corresponding uncertainties. In this study, we use a Bayesian Markov-chain-Monte-Carlo (MCMC) inversion approach to investigate how much information regarding vadose zone hydraulic properties can be retrieved from time-lapse crosshole GPR data collected at the Arrenaes field site in Denmark during a forced infiltration experiment.