901 resultados para Asymptotic behaviour, Bayesian methods, Mixture models, Overfitting, Posterior concentration
Resumo:
This thesis is concerned with approximate inference in dynamical systems, from a variational Bayesian perspective. When modelling real world dynamical systems, stochastic differential equations appear as a natural choice, mainly because of their ability to model the noise of the system by adding a variant of some stochastic process to the deterministic dynamics. Hence, inference in such processes has drawn much attention. Here two new extended frameworks are derived and presented that are based on basis function expansions and local polynomial approximations of a recently proposed variational Bayesian algorithm. It is shown that the new extensions converge to the original variational algorithm and can be used for state estimation (smoothing). However, the main focus is on estimating the (hyper-) parameters of these systems (i.e. drift parameters and diffusion coefficients). The new methods are numerically validated on a range of different systems which vary in dimensionality and non-linearity. These are the Ornstein-Uhlenbeck process, for which the exact likelihood can be computed analytically, the univariate and highly non-linear, stochastic double well and the multivariate chaotic stochastic Lorenz '63 (3-dimensional model). The algorithms are also applied to the 40 dimensional stochastic Lorenz '96 system. In this investigation these new approaches are compared with a variety of other well known methods such as the ensemble Kalman filter / smoother, a hybrid Monte Carlo sampler, the dual unscented Kalman filter (for jointly estimating the systems states and model parameters) and full weak-constraint 4D-Var. Empirical analysis of their asymptotic behaviour as a function of observation density or length of time window increases is provided.
Resumo:
This work is concerned with approximate inference in dynamical systems, from a variational Bayesian perspective. When modelling real world dynamical systems, stochastic differential equations appear as a natural choice, mainly because of their ability to model the noise of the system by adding a variation of some stochastic process to the deterministic dynamics. Hence, inference in such processes has drawn much attention. Here a new extended framework is derived that is based on a local polynomial approximation of a recently proposed variational Bayesian algorithm. The paper begins by showing that the new extension of this variational algorithm can be used for state estimation (smoothing) and converges to the original algorithm. However, the main focus is on estimating the (hyper-) parameters of these systems (i.e. drift parameters and diffusion coefficients). The new approach is validated on a range of different systems which vary in dimensionality and non-linearity. These are the Ornstein–Uhlenbeck process, the exact likelihood of which can be computed analytically, the univariate and highly non-linear, stochastic double well and the multivariate chaotic stochastic Lorenz ’63 (3D model). As a special case the algorithm is also applied to the 40 dimensional stochastic Lorenz ’96 system. In our investigation we compare this new approach with a variety of other well known methods, such as the hybrid Monte Carlo, dual unscented Kalman filter, full weak-constraint 4D-Var algorithm and analyse empirically their asymptotic behaviour as a function of observation density or length of time window increases. In particular we show that we are able to estimate parameters in both the drift (deterministic) and the diffusion (stochastic) part of the model evolution equations using our new methods.
Resumo:
Surveys can collect important data that inform policy decisions and drive social science research. Large government surveys collect information from the U.S. population on a wide range of topics, including demographics, education, employment, and lifestyle. Analysis of survey data presents unique challenges. In particular, one needs to account for missing data, for complex sampling designs, and for measurement error. Conceptually, a survey organization could spend lots of resources getting high-quality responses from a simple random sample, resulting in survey data that are easy to analyze. However, this scenario often is not realistic. To address these practical issues, survey organizations can leverage the information available from other sources of data. For example, in longitudinal studies that suffer from attrition, they can use the information from refreshment samples to correct for potential attrition bias. They can use information from known marginal distributions or survey design to improve inferences. They can use information from gold standard sources to correct for measurement error.
This thesis presents novel approaches to combining information from multiple sources that address the three problems described above.
The first method addresses nonignorable unit nonresponse and attrition in a panel survey with a refreshment sample. Panel surveys typically suffer from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, the panel data alone cannot inform the extent of the bias due to attrition, so analysts must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new
individuals during some later wave of the panel. Refreshment samples offer information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the
refreshment sample itself. As we illustrate, nonignorable unit nonresponse
can significantly compromise the analyst's ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences---corrected for panel attrition---are to different assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse
in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study.
The second method incorporates informative prior beliefs about
marginal probabilities into Bayesian latent class models for categorical data.
The basic idea is to append synthetic observations to the original data such that
(i) the empirical distributions of the desired margins match those of the prior beliefs, and (ii) the values of the remaining variables are left missing. The degree of prior uncertainty is controlled by the number of augmented records. Posterior inferences can be obtained via typical MCMC algorithms for latent class models, tailored to deal efficiently with the missing values in the concatenated data.
We illustrate the approach using a variety of simulations based on data from the American Community Survey, including an example of how augmented records can be used to fit latent class models to data from stratified samples.
The third method leverages the information from a gold standard survey to model reporting error. Survey data are subject to reporting error when respondents misunderstand the question or accidentally select the wrong response. Sometimes survey respondents knowingly select the wrong response, for example, by reporting a higher level of education than they actually have attained. We present an approach that allows an analyst to model reporting error by incorporating information from a gold standard survey. The analyst can specify various reporting error models and assess how sensitive their conclusions are to different assumptions about the reporting error process. We illustrate the approach using simulations based on data from the 1993 National Survey of College Graduates. We use the method to impute error-corrected educational attainments in the 2010 American Community Survey using the 2010 National Survey of College Graduates as the gold standard survey.
Resumo:
This thesis presents quantitative studies of T cell and dendritic cell (DC) behaviour in mouse lymph nodes (LNs) in the naive state and following immunisation. These processes are of importance and interest in basic immunology, and better understanding could improve both diagnostic capacity and therapeutic manipulations, potentially helping in producing more effective vaccines or developing treatments for autoimmune diseases. The problem is also interesting conceptually as it is relevant to other fields where 3D movement of objects is tracked with a discrete scanning interval. A general immunology introduction is presented in chapter 1. In chapter 2, I apply quantitative methods to multi-photon imaging data to measure how T cells and DCs are spatially arranged in LNs. This has been previously studied to describe differences between the naive and immunised state and as an indicator of the magnitude of the immune response in LNs, but previous analyses have been generally descriptive. The quantitative analysis shows that some of the previous conclusions may have been premature. In chapter 3, I use Bayesian state-space models to test some hypotheses about the mode of T cell search for DCs. A two-state mode of movement where T cells can be classified as either interacting to a DC or freely migrating is supported over a model where T cells would home in on DCs at distance through for example the action of chemokines. In chapter 4, I study whether T cell migration is linked to the geometric structure of the fibroblast reticular network (FRC). I find support for the hypothesis that the movement is constrained to the fibroblast reticular cell (FRC) network over an alternative 'random walk with persistence time' model where cells would move randomly, with a short-term persistence driven by a hypothetical T cell intrinsic 'clock'. I also present unexpected results on the FRC network geometry. Finally, a quantitative method is presented for addressing some measurement biases inherent to multi-photon imaging. In all three chapters, novel findings are made, and the methods developed have the potential for further use to address important problems in the field. In chapter 5, I present a summary and synthesis of results from chapters 3-4 and a more speculative discussion of these results and potential future directions.
Resumo:
Objective We aimed to predict sub-national spatial variation in numbers of people infected with Schistosoma haematobium, and associated uncertainties, in Burkina Faso, Mali and Niger, prior to implementation of national control programmes. Methods We used national field survey datasets covering a contiguous area 2,750 × 850 km, from 26,790 school-aged children (5–14 years) in 418 schools. Bayesian geostatistical models were used to predict prevalence of high and low intensity infections and associated 95% credible intervals (CrI). Numbers infected were determined by multiplying predicted prevalence by numbers of school-aged children in 1 km2 pixels covering the study area. Findings Numbers of school-aged children with low-intensity infections were: 433,268 in Burkina Faso, 872,328 in Mali and 580,286 in Niger. Numbers with high-intensity infections were: 416,009 in Burkina Faso, 511,845 in Mali and 254,150 in Niger. 95% CrIs (indicative of uncertainty) were wide; e.g. the mean number of boys aged 10–14 years infected in Mali was 140,200 (95% CrI 6200, 512,100). Conclusion National aggregate estimates for numbers infected mask important local variation, e.g. most S. haematobium infections in Niger occur in the Niger River valley. Prevalence of high-intensity infections was strongly clustered in foci in western and central Mali, north-eastern and northwestern Burkina Faso and the Niger River valley in Niger. Populations in these foci are likely to carry the bulk of the urinary schistosomiasis burden and should receive priority for schistosomiasis control. Uncertainties in predicted prevalence and numbers infected should be acknowledged and taken into consideration by control programme planners.
Resumo:
The main objective of this PhD was to further develop Bayesian spatio-temporal models (specifically the Conditional Autoregressive (CAR) class of models), for the analysis of sparse disease outcomes such as birth defects. The motivation for the thesis arose from problems encountered when analyzing a large birth defect registry in New South Wales. The specific components and related research objectives of the thesis were developed from gaps in the literature on current formulations of the CAR model, and health service planning requirements. Data from a large probabilistically-linked database from 1990 to 2004, consisting of fields from two separate registries: the Birth Defect Registry (BDR) and Midwives Data Collection (MDC) were used in the analyses in this thesis. The main objective was split into smaller goals. The first goal was to determine how the specification of the neighbourhood weight matrix will affect the smoothing properties of the CAR model, and this is the focus of chapter 6. Secondly, I hoped to evaluate the usefulness of incorporating a zero-inflated Poisson (ZIP) component as well as a shared-component model in terms of modeling a sparse outcome, and this is carried out in chapter 7. The third goal was to identify optimal sampling and sample size schemes designed to select individual level data for a hybrid ecological spatial model, and this is done in chapter 8. Finally, I wanted to put together the earlier improvements to the CAR model, and along with demographic projections, provide forecasts for birth defects at the SLA level. Chapter 9 describes how this is done. For the first objective, I examined a series of neighbourhood weight matrices, and showed how smoothing the relative risk estimates according to similarity by an important covariate (i.e. maternal age) helped improve the model’s ability to recover the underlying risk, as compared to the traditional adjacency (specifically the Queen) method of applying weights. Next, to address the sparseness and excess zeros commonly encountered in the analysis of rare outcomes such as birth defects, I compared a few models, including an extension of the usual Poisson model to encompass excess zeros in the data. This was achieved via a mixture model, which also encompassed the shared component model to improve on the estimation of sparse counts through borrowing strength across a shared component (e.g. latent risk factor/s) with the referent outcome (caesarean section was used in this example). Using the Deviance Information Criteria (DIC), I showed how the proposed model performed better than the usual models, but only when both outcomes shared a strong spatial correlation. The next objective involved identifying the optimal sampling and sample size strategy for incorporating individual-level data with areal covariates in a hybrid study design. I performed extensive simulation studies, evaluating thirteen different sampling schemes along with variations in sample size. This was done in the context of an ecological regression model that incorporated spatial correlation in the outcomes, as well as accommodating both individual and areal measures of covariates. Using the Average Mean Squared Error (AMSE), I showed how a simple random sample of 20% of the SLAs, followed by selecting all cases in the SLAs chosen, along with an equal number of controls, provided the lowest AMSE. The final objective involved combining the improved spatio-temporal CAR model with population (i.e. women) forecasts, to provide 30-year annual estimates of birth defects at the Statistical Local Area (SLA) level in New South Wales, Australia. The projections were illustrated using sixteen different SLAs, representing the various areal measures of socio-economic status and remoteness. A sensitivity analysis of the assumptions used in the projection was also undertaken. By the end of the thesis, I will show how challenges in the spatial analysis of rare diseases such as birth defects can be addressed, by specifically formulating the neighbourhood weight matrix to smooth according to a key covariate (i.e. maternal age), incorporating a ZIP component to model excess zeros in outcomes and borrowing strength from a referent outcome (i.e. caesarean counts). An efficient strategy to sample individual-level data and sample size considerations for rare disease will also be presented. Finally, projections in birth defect categories at the SLA level will be made.
Resumo:
In this thesis, the issue of incorporating uncertainty for environmental modelling informed by imagery is explored by considering uncertainty in deterministic modelling, measurement uncertainty and uncertainty in image composition. Incorporating uncertainty in deterministic modelling is extended for use with imagery using the Bayesian melding approach. In the application presented, slope steepness is shown to be the main contributor to total uncertainty in the Revised Universal Soil Loss Equation. A spatial sampling procedure is also proposed to assist in implementing Bayesian melding given the increased data size with models informed by imagery. Measurement error models are another approach to incorporating uncertainty when data is informed by imagery. These models for measurement uncertainty, considered in a Bayesian conditional independence framework, are applied to ecological data generated from imagery. The models are shown to be appropriate and useful in certain situations. Measurement uncertainty is also considered in the context of change detection when two images are not co-registered. An approach for detecting change in two successive images is proposed that is not affected by registration. The procedure uses the Kolmogorov-Smirnov test on homogeneous segments of an image to detect change, with the homogeneous segments determined using a Bayesian mixture model of pixel values. Using the mixture model to segment an image also allows for uncertainty in the composition of an image. This thesis concludes by comparing several different Bayesian image segmentation approaches that allow for uncertainty regarding the allocation of pixels to different ground components. Each segmentation approach is applied to a data set of chlorophyll values and shown to have different benefits and drawbacks depending on the aims of the analysis.
Resumo:
Longitudinal data, where data are repeatedly observed or measured on a temporal basis of time or age provides the foundation of the analysis of processes which evolve over time, and these can be referred to as growth or trajectory models. One of the traditional ways of looking at growth models is to employ either linear or polynomial functional forms to model trajectory shape, and account for variation around an overall mean trend with the inclusion of random eects or individual variation on the functional shape parameters. The identification of distinct subgroups or sub-classes (latent classes) within these trajectory models which are not based on some pre-existing individual classification provides an important methodology with substantive implications. The identification of subgroups or classes has a wide application in the medical arena where responder/non-responder identification based on distinctly diering trajectories delivers further information for clinical processes. This thesis develops Bayesian statistical models and techniques for the identification of subgroups in the analysis of longitudinal data where the number of time intervals is limited. These models are then applied to a single case study which investigates the neuropsychological cognition for early stage breast cancer patients undergoing adjuvant chemotherapy treatment from the Cognition in Breast Cancer Study undertaken by the Wesley Research Institute of Brisbane, Queensland. Alternative formulations to the linear or polynomial approach are taken which use piecewise linear models with a single turning point, change-point or knot at a known time point and latent basis models for the non-linear trajectories found for the verbal memory domain of cognitive function before and after chemotherapy treatment. Hierarchical Bayesian random eects models are used as a starting point for the latent class modelling process and are extended with the incorporation of covariates in the trajectory profiles and as predictors of class membership. The Bayesian latent basis models enable the degree of recovery post-chemotherapy to be estimated for short and long-term followup occasions, and the distinct class trajectories assist in the identification of breast cancer patients who maybe at risk of long-term verbal memory impairment.
Resumo:
This paper describes the formalization and application of a methodology to evaluate the safety benefit of countermeasures in the face of uncertainty. To illustrate the methodology, 18 countermeasures for improving safety of at grade railroad crossings (AGRXs) in the Republic of Korea are considered. Akin to “stated preference” methods in travel survey research, the methodology applies random selection and laws of large numbers to derive accident modification factor (AMF) densities from expert opinions. In a full Bayesian analysis framework, the collective opinions in the form of AMF densities (data likelihood) are combined with prior knowledge (AMF density priors) for the 18 countermeasures to obtain ‘best’ estimates of AMFs (AMF posterior credible intervals). The countermeasures are then compared and recommended based on the largest safety returns with minimum risk (uncertainty). To the author's knowledge the complete methodology is new and has not previously been applied or reported in the literature. The results demonstrate that the methodology is able to discern anticipated safety benefit differences across candidate countermeasures. For the 18 at grade railroad crossings considered in this analysis, it was found that the top three performing countermeasures for reducing crashes are in-vehicle warning systems, obstacle detection systems, and constant warning time systems.
Resumo:
We present a novel approach for developing summary statistics for use in approximate Bayesian computation (ABC) algorithms by using indirect inference. ABC methods are useful for posterior inference in the presence of an intractable likelihood function. In the indirect inference approach to ABC the parameters of an auxiliary model fitted to the data become the summary statistics. Although applicable to any ABC technique, we embed this approach within a sequential Monte Carlo algorithm that is completely adaptive and requires very little tuning. This methodological development was motivated by an application involving data on macroparasite population evolution modelled by a trivariate stochastic process for which there is no tractable likelihood function. The auxiliary model here is based on a beta–binomial distribution. The main objective of the analysis is to determine which parameters of the stochastic model are estimable from the observed data on mature parasite worms.
Resumo:
Sequence data often have competing signals that are detected by network programs or Lento plots. Such data can be formed by generating sequences on more than one tree, and combining the results, a mixture model. We report that with such mixture models, the estimates of edge (branch) lengths from maximum likelihood (ML) methods that assume a single tree are biased. Based on the observed number of competing signals in real data, such a bias of ML is expected to occur frequently. Because network methods can recover competing signals more accurately, there is a need for ML methods allowing a network. A fundamental problem is that mixture models can have more parameters than can be recovered from the data, so that some mixtures are not, in principle, identifiable. We recommend that network programs be incorporated into best practice analysis, along with ML and Bayesian trees.
Resumo:
Motorcycles are overrepresented in road traffic crashes and particularly vulnerable at signalized intersections. The objective of this study is to identify causal factors affecting the motorcycle crashes at both four-legged and T signalized intersections. Treating the data in time-series cross-section panels, this study explores different Hierarchical Poisson models and found that the model allowing autoregressive lag 1 dependent specification in the error term is the most suitable. Results show that the number of lanes at the four-legged signalized intersections significantly increases motorcycle crashes largely because of the higher exposure resulting from higher motorcycle accumulation at the stop line. Furthermore, the presence of a wide median and an uncontrolled left-turn lane at major roadways of four-legged intersections exacerbate this potential hazard. For T signalized intersections, the presence of exclusive right-turn lane at both major and minor roadways and an uncontrolled left-turn lane at major roadways of T intersections increases motorcycle crashes. Motorcycle crashes increase on high-speed roadways because they are more vulnerable and less likely to react in time during conflicts. The presence of red light cameras reduces motorcycle crashes significantly for both four-legged and T intersections. With the red-light camera, motorcycles are less exposed to conflicts because it is observed that they are more disciplined in queuing at the stop line and less likely to jump start at the start of green.
Resumo:
This paper presents a novel framework for the modelling of passenger facilitation in a complex environment. The research is motivated by the challenges in the airport complex system, where there are multiple stakeholders, differing operational objectives and complex interactions and interdependencies between different parts of the airport system. Traditional methods for airport terminal modelling do not explicitly address the need for understanding causal relationships in a dynamic environment. Additionally, existing Bayesian Network (BN) models, which provide a means for capturing causal relationships, only present a static snapshot of a system. A method to integrate a BN complex systems model with stochastic queuing theory is developed based on the properties of the Poisson and Exponential distributions. The resultant Hybrid Queue-based Bayesian Network (HQBN) framework enables the simulation of arbitrary factors, their relationships, and their effects on passenger flow and vice versa. A case study implementation of the framework is demonstrated on the inbound passenger facilitation process at Brisbane International Airport. The predicted outputs of the model, in terms of cumulative passenger flow at intermediary and end points in the inbound process, are found to have an $R^2$ goodness of fit of 0.9994 and 0.9982 respectively over a 10 hour test period. The utility of the framework is demonstrated on a number of usage scenarios including real time monitoring and `what-if' analysis. This framework provides the ability to analyse and simulate a dynamic complex system, and can be applied to other socio-technical systems such as hospitals.
Resumo:
This paper presents a novel framework for the modelling of passenger facilitation in a complex environment. The research is motivated by the challenges in the airport complex system, where there are multiple stakeholders, differing operational objectives and complex interactions and interdependencies between different parts of the airport system. Traditional methods for airport terminal modelling do not explicitly address the need for understanding causal relationships in a dynamic environment. Additionally, existing Bayesian Network (BN) models, which provide a means for capturing causal relationships, only present a static snapshot of a system. A method to integrate a BN complex systems model with stochastic queuing theory is developed based on the properties of the Poisson and exponential distributions. The resultant Hybrid Queue-based Bayesian Network (HQBN) framework enables the simulation of arbitrary factors, their relationships, and their effects on passenger flow and vice versa. A case study implementation of the framework is demonstrated on the inbound passenger facilitation process at Brisbane International Airport. The predicted outputs of the model, in terms of cumulative passenger flow at intermediary and end points in the inbound process, are found to have an R2 goodness of fit of 0.9994 and 0.9982 respectively over a 10 h test period. The utility of the framework is demonstrated on a number of usage scenarios including causal analysis and ‘what-if’ analysis. This framework provides the ability to analyse and simulate a dynamic complex system, and can be applied to other socio-technical systems such as hospitals.