259 resultados para MCMC
Resumo:
Monte Carlo (MC) methods are widely used in signal processing, machine learning and stochastic optimization. A well-known class of MC methods are Markov Chain Monte Carlo (MCMC) algorithms. In this work, we introduce a novel parallel interacting MCMC scheme, where the parallel chains share information using another MCMC technique working on the entire population of current states. These parallel ?vertical? chains are led by random-walk proposals, whereas the ?horizontal? MCMC uses a independent proposal, which can be easily adapted by making use of all the generated samples. Numerical results show the advantages of the proposed sampling scheme in terms of mean absolute error, as well as robustness w.r.t. to initial values and parameter choice.
Resumo:
In this paper we investigate a Bayesian procedure for the estimation of a flexible generalised distribution, notably the MacGillivray adaptation of the g-and-κ distribution. This distribution, described through its inverse cdf or quantile function, generalises the standard normal through extra parameters which together describe skewness and kurtosis. The standard quantile-based methods for estimating the parameters of generalised distributions are often arbitrary and do not rely on computation of the likelihood. MCMC, however, provides a simulation-based alternative for obtaining the maximum likelihood estimates of parameters of these distributions or for deriving posterior estimates of the parameters through a Bayesian framework. In this paper we adopt the latter approach, The proposed methodology is illustrated through an application in which the parameter of interest is slightly skewed.
Resumo:
This research explores Bayesian updating as a tool for estimating parameters probabilistically by dynamic analysis of data sequences. Two distinct Bayesian updating methodologies are assessed. The first approach focuses on Bayesian updating of failure rates for primary events in fault trees. A Poisson Exponentially Moving Average (PEWMA) model is implemnented to carry out Bayesian updating of failure rates for individual primary events in the fault tree. To provide a basis for testing of the PEWMA model, a fault tree is developed based on the Texas City Refinery incident which occurred in 2005. A qualitative fault tree analysis is then carried out to obtain a logical expression for the top event. A dynamic Fault Tree analysis is carried out by evaluating the top event probability at each Bayesian updating step by Monte Carlo sampling from posterior failure rate distributions. It is demonstrated that PEWMA modeling is advantageous over conventional conjugate Poisson-Gamma updating techniques when failure data is collected over long time spans. The second approach focuses on Bayesian updating of parameters in non-linear forward models. Specifically, the technique is applied to the hydrocarbon material balance equation. In order to test the accuracy of the implemented Bayesian updating models, a synthetic data set is developed using the Eclipse reservoir simulator. Both structured grid and MCMC sampling based solution techniques are implemented and are shown to model the synthetic data set with good accuracy. Furthermore, a graphical analysis shows that the implemented MCMC model displays good convergence properties. A case study demonstrates that Likelihood variance affects the rate at which the posterior assimilates information from the measured data sequence. Error in the measured data significantly affects the accuracy of the posterior parameter distributions. Increasing the likelihood variance mitigates random measurement errors, but casuses the overall variance of the posterior to increase. Bayesian updating is shown to be advantageous over deterministic regression techniques as it allows for incorporation of prior belief and full modeling uncertainty over the parameter ranges. As such, the Bayesian approach to estimation of parameters in the material balance equation shows utility for incorporation into reservoir engineering workflows.
Resumo:
Tässä tutkielmassa rintasyövän ja ruoansulatuselinten syöpien perheittäistä kertymistä ja perimäosuutta estimoitiin lapsena tai nuorena syövän sairastaneiden suomalaisten perheaineistoissa. Perheet poimittiin siten, että jokaisessa perheessä on vähintään yksi alle 40-vuotiaana diagnosoitu syöpätapaus vuosina 1970-2012. Rintasyöpäaineisto koostui 4921 perheestä, joissa oli kaikkiaan 26 259 henkilöä. Ruoansulatuselinten aineisto puolestaan koostui 3328 perheestä ja 22 441 henkilöstä. Syövän perimäosuuden suhteellista ilmaantuvuutta mallinnettiin hierarkkisella bayesiläisellä Poisson-regressio sekamallilla, jossa sairastumisalttiuden vaihtelu jaettiin ympäristön, perimän ja ylihajonnan komponentteihin. Parametrien yhteisposteriorijakaumaa arvioitiin MCMC-otannan avulla JAGS-ohjelmalla. Lisäksi syöpien kertymistä tarkasteltiin estimoimalla sukulaisuussuhteiden mukaan ositettuja suhteellisia syöpäilmaantuvuuksia. Simulaatiotutkimuksella arvioitiin tilastollisen mallin satunnaiskomponenttien estimoituminen ja tarkasteltiin harhan korjauksen vaikutusta tutkimusasetelmaan. Rintasyöpäaineistossa nuorten syöpäpotilaiden perheenjäsenillä havaittiin 739 syöpää ja perheenjäsenten keskimääräinen syöpäriski oli 81% (95%:n todennäköisyysväli 68-94%) suurempi kuin vastaavalla väestöllä. Rintasyövän perimäosuus oli 26% (0-57%). Ruoansulatuselinten syöpiä havaittiin perheenjäsenillä 574 ja perheenjäsenten syöpäriski oli 60% (48-73%) suurempi kuin väestöllä ja sen perimäosuudeksi estimoitiin 63% (37-88%). Tutkielman tulosten mukaan ympäristötekijöiden merkitys rintasyöpäaineistossa on suuri. Vastaavasti ruoansulatuselinten syövissä ympäristötekijöiden merkitys on pienempi ja perimän osuus selvästi suurempi.
Resumo:
The main objective of this PhD was to further develop Bayesian spatio-temporal models (specifically the Conditional Autoregressive (CAR) class of models), for the analysis of sparse disease outcomes such as birth defects. The motivation for the thesis arose from problems encountered when analyzing a large birth defect registry in New South Wales. The specific components and related research objectives of the thesis were developed from gaps in the literature on current formulations of the CAR model, and health service planning requirements. Data from a large probabilistically-linked database from 1990 to 2004, consisting of fields from two separate registries: the Birth Defect Registry (BDR) and Midwives Data Collection (MDC) were used in the analyses in this thesis. The main objective was split into smaller goals. The first goal was to determine how the specification of the neighbourhood weight matrix will affect the smoothing properties of the CAR model, and this is the focus of chapter 6. Secondly, I hoped to evaluate the usefulness of incorporating a zero-inflated Poisson (ZIP) component as well as a shared-component model in terms of modeling a sparse outcome, and this is carried out in chapter 7. The third goal was to identify optimal sampling and sample size schemes designed to select individual level data for a hybrid ecological spatial model, and this is done in chapter 8. Finally, I wanted to put together the earlier improvements to the CAR model, and along with demographic projections, provide forecasts for birth defects at the SLA level. Chapter 9 describes how this is done. For the first objective, I examined a series of neighbourhood weight matrices, and showed how smoothing the relative risk estimates according to similarity by an important covariate (i.e. maternal age) helped improve the model’s ability to recover the underlying risk, as compared to the traditional adjacency (specifically the Queen) method of applying weights. Next, to address the sparseness and excess zeros commonly encountered in the analysis of rare outcomes such as birth defects, I compared a few models, including an extension of the usual Poisson model to encompass excess zeros in the data. This was achieved via a mixture model, which also encompassed the shared component model to improve on the estimation of sparse counts through borrowing strength across a shared component (e.g. latent risk factor/s) with the referent outcome (caesarean section was used in this example). Using the Deviance Information Criteria (DIC), I showed how the proposed model performed better than the usual models, but only when both outcomes shared a strong spatial correlation. The next objective involved identifying the optimal sampling and sample size strategy for incorporating individual-level data with areal covariates in a hybrid study design. I performed extensive simulation studies, evaluating thirteen different sampling schemes along with variations in sample size. This was done in the context of an ecological regression model that incorporated spatial correlation in the outcomes, as well as accommodating both individual and areal measures of covariates. Using the Average Mean Squared Error (AMSE), I showed how a simple random sample of 20% of the SLAs, followed by selecting all cases in the SLAs chosen, along with an equal number of controls, provided the lowest AMSE. The final objective involved combining the improved spatio-temporal CAR model with population (i.e. women) forecasts, to provide 30-year annual estimates of birth defects at the Statistical Local Area (SLA) level in New South Wales, Australia. The projections were illustrated using sixteen different SLAs, representing the various areal measures of socio-economic status and remoteness. A sensitivity analysis of the assumptions used in the projection was also undertaken. By the end of the thesis, I will show how challenges in the spatial analysis of rare diseases such as birth defects can be addressed, by specifically formulating the neighbourhood weight matrix to smooth according to a key covariate (i.e. maternal age), incorporating a ZIP component to model excess zeros in outcomes and borrowing strength from a referent outcome (i.e. caesarean counts). An efficient strategy to sample individual-level data and sample size considerations for rare disease will also be presented. Finally, projections in birth defect categories at the SLA level will be made.
Resumo:
This thesis addresses computational challenges arising from Bayesian analysis of complex real-world problems. Many of the models and algorithms designed for such analysis are ‘hybrid’ in nature, in that they are a composition of components for which their individual properties may be easily described but the performance of the model or algorithm as a whole is less well understood. The aim of this research project is to after a better understanding of the performance of hybrid models and algorithms. The goal of this thesis is to analyse the computational aspects of hybrid models and hybrid algorithms in the Bayesian context. The first objective of the research focuses on computational aspects of hybrid models, notably a continuous finite mixture of t-distributions. In the mixture model, an inference of interest is the number of components, as this may relate to both the quality of model fit to data and the computational workload. The analysis of t-mixtures using Markov chain Monte Carlo (MCMC) is described and the model is compared to the Normal case based on the goodness of fit. Through simulation studies, it is demonstrated that the t-mixture model can be more flexible and more parsimonious in terms of number of components, particularly for skewed and heavytailed data. The study also reveals important computational issues associated with the use of t-mixtures, which have not been adequately considered in the literature. The second objective of the research focuses on computational aspects of hybrid algorithms for Bayesian analysis. Two approaches will be considered: a formal comparison of the performance of a range of hybrid algorithms and a theoretical investigation of the performance of one of these algorithms in high dimensions. For the first approach, the delayed rejection algorithm, the pinball sampler, the Metropolis adjusted Langevin algorithm, and the hybrid version of the population Monte Carlo (PMC) algorithm are selected as a set of examples of hybrid algorithms. Statistical literature shows how statistical efficiency is often the only criteria for an efficient algorithm. In this thesis the algorithms are also considered and compared from a more practical perspective. This extends to the study of how individual algorithms contribute to the overall efficiency of hybrid algorithms, and highlights weaknesses that may be introduced by the combination process of these components in a single algorithm. The second approach to considering computational aspects of hybrid algorithms involves an investigation of the performance of the PMC in high dimensions. It is well known that as a model becomes more complex, computation may become increasingly difficult in real time. In particular the importance sampling based algorithms, including the PMC, are known to be unstable in high dimensions. This thesis examines the PMC algorithm in a simplified setting, a single step of the general sampling, and explores a fundamental problem that occurs in applying importance sampling to a high-dimensional problem. The precision of the computed estimate from the simplified setting is measured by the asymptotic variance of the estimate under conditions on the importance function. Additionally, the exponential growth of the asymptotic variance with the dimension is demonstrated and we illustrates that the optimal covariance matrix for the importance function can be estimated in a special case.
Resumo:
In this thesis, the issue of incorporating uncertainty for environmental modelling informed by imagery is explored by considering uncertainty in deterministic modelling, measurement uncertainty and uncertainty in image composition. Incorporating uncertainty in deterministic modelling is extended for use with imagery using the Bayesian melding approach. In the application presented, slope steepness is shown to be the main contributor to total uncertainty in the Revised Universal Soil Loss Equation. A spatial sampling procedure is also proposed to assist in implementing Bayesian melding given the increased data size with models informed by imagery. Measurement error models are another approach to incorporating uncertainty when data is informed by imagery. These models for measurement uncertainty, considered in a Bayesian conditional independence framework, are applied to ecological data generated from imagery. The models are shown to be appropriate and useful in certain situations. Measurement uncertainty is also considered in the context of change detection when two images are not co-registered. An approach for detecting change in two successive images is proposed that is not affected by registration. The procedure uses the Kolmogorov-Smirnov test on homogeneous segments of an image to detect change, with the homogeneous segments determined using a Bayesian mixture model of pixel values. Using the mixture model to segment an image also allows for uncertainty in the composition of an image. This thesis concludes by comparing several different Bayesian image segmentation approaches that allow for uncertainty regarding the allocation of pixels to different ground components. Each segmentation approach is applied to a data set of chlorophyll values and shown to have different benefits and drawbacks depending on the aims of the analysis.
Resumo:
This dissertation is primarily an applied statistical modelling investigation, motivated by a case study comprising real data and real questions. Theoretical questions on modelling and computation of normalization constants arose from pursuit of these data analytic questions. The essence of the thesis can be described as follows. Consider binary data observed on a two-dimensional lattice. A common problem with such data is the ambiguity of zeroes recorded. These may represent zero response given some threshold (presence) or that the threshold has not been triggered (absence). Suppose that the researcher wishes to estimate the effects of covariates on the binary responses, whilst taking into account underlying spatial variation, which is itself of some interest. This situation arises in many contexts and the dingo, cypress and toad case studies described in the motivation chapter are examples of this. Two main approaches to modelling and inference are investigated in this thesis. The first is frequentist and based on generalized linear models, with spatial variation modelled by using a block structure or by smoothing the residuals spatially. The EM algorithm can be used to obtain point estimates, coupled with bootstrapping or asymptotic MLE estimates for standard errors. The second approach is Bayesian and based on a three- or four-tier hierarchical model, comprising a logistic regression with covariates for the data layer, a binary Markov Random field (MRF) for the underlying spatial process, and suitable priors for parameters in these main models. The three-parameter autologistic model is a particular MRF of interest. Markov chain Monte Carlo (MCMC) methods comprising hybrid Metropolis/Gibbs samplers is suitable for computation in this situation. Model performance can be gauged by MCMC diagnostics. Model choice can be assessed by incorporating another tier in the modelling hierarchy. This requires evaluation of a normalization constant, a notoriously difficult problem. Difficulty with estimating the normalization constant for the MRF can be overcome by using a path integral approach, although this is a highly computationally intensive method. Different methods of estimating ratios of normalization constants (N Cs) are investigated, including importance sampling Monte Carlo (ISMC), dependent Monte Carlo based on MCMC simulations (MCMC), and reverse logistic regression (RLR). I develop an idea present though not fully developed in the literature, and propose the Integrated mean canonical statistic (IMCS) method for estimating log NC ratios for binary MRFs. The IMCS method falls within the framework of the newly identified path sampling methods of Gelman & Meng (1998) and outperforms ISMC, MCMC and RLR. It also does not rely on simplifying assumptions, such as ignoring spatio-temporal dependence in the process. A thorough investigation is made of the application of IMCS to the three-parameter Autologistic model. This work introduces background computations required for the full implementation of the four-tier model in Chapter 7. Two different extensions of the three-tier model to a four-tier version are investigated. The first extension incorporates temporal dependence in the underlying spatio-temporal process. The second extensions allows the successes and failures in the data layer to depend on time. The MCMC computational method is extended to incorporate the extra layer. A major contribution of the thesis is the development of a fully Bayesian approach to inference for these hierarchical models for the first time. Note: The author of this thesis has agreed to make it open access but invites people downloading the thesis to send her an email via the 'Contact Author' function.
Resumo:
A statistical modeling method to accurately determine combustion chamber resonance is proposed and demonstrated. This method utilises Markov-chain Monte Carlo (MCMC) through the use of the Metropolis-Hastings (MH) algorithm to yield a probability density function for the combustion chamber frequency and find the best estimate of the resonant frequency, along with uncertainty. The accurate determination of combustion chamber resonance is then used to investigate various engine phenomena, with appropriate uncertainty, for a range of engine cycles. It is shown that, when operating on various ethanol/diesel fuel combinations, a 20% substitution yields the least amount of inter-cycle variability, in relation to combustion chamber resonance.
Resumo:
Statistical modeling of traffic crashes has been of interest to researchers for decades. Over the most recent decade many crash models have accounted for extra-variation in crash counts—variation over and above that accounted for by the Poisson density. The extra-variation – or dispersion – is theorized to capture unaccounted for variation in crashes across sites. The majority of studies have assumed fixed dispersion parameters in over-dispersed crash models—tantamount to assuming that unaccounted for variation is proportional to the expected crash count. Miaou and Lord [Miaou, S.P., Lord, D., 2003. Modeling traffic crash-flow relationships for intersections: dispersion parameter, functional form, and Bayes versus empirical Bayes methods. Transport. Res. Rec. 1840, 31–40] challenged the fixed dispersion parameter assumption, and examined various dispersion parameter relationships when modeling urban signalized intersection accidents in Toronto. They suggested that further work is needed to determine the appropriateness of the findings for rural as well as other intersection types, to corroborate their findings, and to explore alternative dispersion functions. This study builds upon the work of Miaou and Lord, with exploration of additional dispersion functions, the use of an independent data set, and presents an opportunity to corroborate their findings. Data from Georgia are used in this study. A Bayesian modeling approach with non-informative priors is adopted, using sampling-based estimation via Markov Chain Monte Carlo (MCMC) and the Gibbs sampler. A total of eight model specifications were developed; four of them employed traffic flows as explanatory factors in mean structure while the remainder of them included geometric factors in addition to major and minor road traffic flows. The models were compared and contrasted using the significance of coefficients, standard deviance, chi-square goodness-of-fit, and deviance information criteria (DIC) statistics. The findings indicate that the modeling of the dispersion parameter, which essentially explains the extra-variance structure, depends greatly on how the mean structure is modeled. In the presence of a well-defined mean function, the extra-variance structure generally becomes insignificant, i.e. the variance structure is a simple function of the mean. It appears that extra-variation is a function of covariates when the mean structure (expected crash count) is poorly specified and suffers from omitted variables. In contrast, when sufficient explanatory variables are used to model the mean (expected crash count), extra-Poisson variation is not significantly related to these variables. If these results are generalizable, they suggest that model specification may be improved by testing extra-variation functions for significance. They also suggest that known influences of expected crash counts are likely to be different than factors that might help to explain unaccounted for variation in crashes across sites
Resumo:
Modern statistical models and computational methods can now incorporate uncertainty of the parameters used in Quantitative Microbial Risk Assessments (QMRA). Many QMRAs use Monte Carlo methods, but work from fixed estimates for means, variances and other parameters. We illustrate the ease of estimating all parameters contemporaneously with the risk assessment, incorporating all the parameter uncertainty arising from the experiments from which these parameters are estimated. A Bayesian approach is adopted, using Markov Chain Monte Carlo Gibbs sampling (MCMC) via the freely available software, WinBUGS. The method and its ease of implementation are illustrated by a case study that involves incorporating three disparate datasets into an MCMC framework. The probabilities of infection when the uncertainty associated with parameter estimation is incorporated into a QMRA are shown to be considerably more variable over various dose ranges than the analogous probabilities obtained when constants from the literature are simply ‘plugged’ in as is done in most QMRAs. Neglecting these sources of uncertainty may lead to erroneous decisions for public health and risk management.
Resumo:
This article explores the use of probabilistic classification, namely finite mixture modelling, for identification of complex disease phenotypes, given cross-sectional data. In particular, if focuses on posterior probabilities of subgroup membership, a standard output of finite mixture modelling, and how the quantification of uncertainty in these probabilities can lead to more detailed analyses. Using a Bayesian approach, we describe two practical uses of this uncertainty: (i) as a means of describing a person’s membership to a single or multiple latent subgroups and (ii) as a means of describing identified subgroups by patient-centred covariates not included in model estimation. These proposed uses are demonstrated on a case study in Parkinson’s disease (PD), where latent subgroups are identified using multiple symptoms from the Unified Parkinson’s Disease Rating Scale (UPDRS).