515 resultados para Bayesian Modeling Averaging
em Queensland University of Technology - ePrints Archive
Resumo:
Harmful Algal Blooms (HABs) are a worldwide problem that have been increasing in frequency and extent over the past several decades. HABs severely damage aquatic ecosystems by destroying benthic habitat, reducing invertebrate and fish populations and affecting larger species such as dugong that rely on seagrasses for food. Few statistical models for predicting HAB occurrences have been developed, and in common with most predictive models in ecology, those that have been developed do not fully account for uncertainties in parameters and model structure. This makes management decisions based on these predictions more risky than might be supposed. We used a probit time series model and Bayesian Model Averaging (BMA) to predict occurrences of blooms of Lyngbya majuscula, a toxic cyanophyte, in Deception Bay, Queensland, Australia. We found a suite of useful predictors for HAB occurrence, with Temperature figuring prominently in models with the majority of posterior support, and a model consisting of the single covariate average monthly minimum temperature showed by far the greatest posterior support. A comparison of alternative model averaging strategies was made with one strategy using the full posterior distribution and a simpler approach that utilised the majority of the posterior distribution for predictions but with vastly fewer models. Both BMA approaches showed excellent predictive performance with little difference in their predictive capacity. Applications of BMA are still rare in ecology, particularly in management settings. This study demonstrates the power of BMA as an important management tool that is capable of high predictive performance while fully accounting for both parameter and model uncertainty.
Resumo:
Nitrous oxide (N2O) is one of the greenhouse gases that can contribute to global warming. Spatial variability of N2O can lead to large uncertainties in prediction. However, previous studies have often ignored the spatial dependency to quantify the N2O - environmental factors relationships. Few researches have examined the impacts of various spatial correlation structures (e.g. independence, distance-based and neighbourhood based) on spatial prediction of N2O emissions. This study aimed to assess the impact of three spatial correlation structures on spatial predictions and calibrate the spatial prediction using Bayesian model averaging (BMA) based on replicated, irregular point-referenced data. The data were measured in 17 chambers randomly placed across a 271 m(2) field between October 2007 and September 2008 in the southeast of Australia. We used a Bayesian geostatistical model and a Bayesian spatial conditional autoregressive (CAR) model to investigate and accommodate spatial dependency, and to estimate the effects of environmental variables on N2O emissions across the study site. We compared these with a Bayesian regression model with independent errors. The three approaches resulted in different derived maps of spatial prediction of N2O emissions. We found that incorporating spatial dependency in the model not only substantially improved predictions of N2O emission from soil, but also better quantified uncertainties of soil parameters in the study. The hybrid model structure obtained by BMA improved the accuracy of spatial prediction of N2O emissions across this study region.
Resumo:
Definition of disease phenotype is a necessary preliminary to research into genetic causes of a complex disease. Clinical diagnosis of migraine is currently based on diagnostic criteria developed by the International Headache Society. Previously, we examined the natural clustering of these diagnostic symptoms using latent class analysis (LCA) and found that a four-class model was preferred. However, the classes can be ordered such that all symptoms progressively intensify, suggesting that a single continuous variable representing disease severity may provide a better model. Here, we compare two models: item response theory and LCA, each constructed within a Bayesian context. A deviance information criterion is used to assess model fit. We phenotyped our population sample using these models, estimated heritability and conducted genome-wide linkage analysis using Merlin-qtl. LCA with four classes was again preferred. After transformation, phenotypic trait values derived from both models are highly correlated (correlation = 0.99) and consequently results from subsequent genetic analyses were similar. Heritability was estimated at 0.37, while multipoint linkage analysis produced genome-wide significant linkage to chromosome 7q31-q33 and suggestive linkage to chromosomes 1 and 2. We argue that such continuous measures are a powerful tool for identifying genes contributing to migraine susceptibility.
Resumo:
Genetic research of complex diseases is a challenging, but exciting, area of research. The early development of the research was limited, however, until the completion of the Human Genome and HapMap projects, along with the reduction in the cost of genotyping, which paves the way for understanding the genetic composition of complex diseases. In this thesis, we focus on the statistical methods for two aspects of genetic research: phenotype definition for diseases with complex etiology and methods for identifying potentially associated Single Nucleotide Polymorphisms (SNPs) and SNP-SNP interactions. With regard to phenotype definition for diseases with complex etiology, we firstly investigated the effects of different statistical phenotyping approaches on the subsequent analysis. In light of the findings, and the difficulties in validating the estimated phenotype, we proposed two different methods for reconciling phenotypes of different models using Bayesian model averaging as a coherent mechanism for accounting for model uncertainty. In the second part of the thesis, the focus is turned to the methods for identifying associated SNPs and SNP interactions. We review the use of Bayesian logistic regression with variable selection for SNP identification and extended the model for detecting the interaction effects for population based case-control studies. In this part of study, we also develop a machine learning algorithm to cope with the large scale data analysis, namely modified Logic Regression with Genetic Program (MLR-GEP), which is then compared with the Bayesian model, Random Forests and other variants of logic regression.
Resumo:
This thesis developed and applied Bayesian models for the analysis of survival data. The gene expression was considered as explanatory variables within the Bayesian survival model which can be considered the new contribution in the analysis of such data. The censoring factor that is inherent of survival data has also been addressed in terms of its impact on the fitting of a finite mixture of Weibull distribution with and without covariates. To investigate this, simulation study were carried out under several censoring percentages. Censoring percentage as high as 80% is acceptable here as the work involved high dimensional data. Lastly the Bayesian model averaging approach was developed to incorporate model uncertainty in the prediction of survival.
Resumo:
Soil-based emissions of nitrous oxide (N2O), a well-known greenhouse gas, have been associated with changes in soil water-filled pore space (WFPS) and soil temperature in many previous studies. However, it is acknowledged that the environment-N2O relationship is complex and still relatively poorly unknown. In this article, we employed a Bayesian model selection approach (Reversible jump Markov chain Monte Carlo) to develop a data-informed model of the relationship between daily N2O emissions and daily WFPS and soil temperature measurements between March 2007 and February 2009 from a soil under pasture in Queensland, Australia, taking seasonal factors and time-lagged effects into account. The model indicates a very strong relationship between a hybrid seasonal structure and daily N2O emission, with the latter substantially increased in summer. Given the other variables in the model, daily soil WFPS, lagged by a week, had a negative influence on daily N2O; there was evidence of a nonlinear positive relationship between daily soil WFPS and daily N2O emission; and daily soil temperature tended to have a linear positive relationship with daily N2O emission when daily soil temperature was above a threshold of approximately 19°C. We suggest that this flexible Bayesian modeling approach could facilitate greater understanding of the shape of the covariate-N2O flux relation and detection of effect thresholds in the natural temporal variation of environmental variables on N2O emission.
Resumo:
This study considered the problem of predicting survival, based on three alternative models: a single Weibull, a mixture of Weibulls and a cure model. Instead of the common procedure of choosing a single “best” model, where “best” is defined in terms of goodness of fit to the data, a Bayesian model averaging (BMA) approach was adopted to account for model uncertainty. This was illustrated using a case study in which the aim was the description of lymphoma cancer survival with covariates given by phenotypes and gene expression. The results of this study indicate that if the sample size is sufficiently large, one of the three models emerge as having highest probability given the data, as indicated by the goodness of fit measure; the Bayesian information criterion (BIC). However, when the sample size was reduced, no single model was revealed as “best”, suggesting that a BMA approach would be appropriate. Although a BMA approach can compromise on goodness of fit to the data (when compared to the true model), it can provide robust predictions and facilitate more detailed investigation of the relationships between gene expression and patient survival. Keywords: Bayesian modelling; Bayesian model averaging; Cure model; Markov Chain Monte Carlo; Mixture model; Survival analysis; Weibull distribution
Resumo:
Monitoring stream networks through time provides important ecological information. The sampling design problem is to choose locations where measurements are taken so as to maximise information gathered about physicochemical and biological variables on the stream network. This paper uses a pseudo-Bayesian approach, averaging a utility function over a prior distribution, in finding a design which maximizes the average utility. We use models for correlations of observations on the stream network that are based on stream network distances and described by moving average error models. Utility functions used reflect the needs of the experimenter, such as prediction of location values or estimation of parameters. We propose an algorithmic approach to design with the mean utility of a design estimated using Monte Carlo techniques and an exchange algorithm to search for optimal sampling designs. In particular we focus on the problem of finding an optimal design from a set of fixed designs and finding an optimal subset of a given set of sampling locations. As there are many different variables to measure, such as chemical, physical and biological measurements at each location, designs are derived from models based on different types of response variables: continuous, counts and proportions. We apply the methodology to a synthetic example and the Lake Eacham stream network on the Atherton Tablelands in Queensland, Australia. We show that the optimal designs depend very much on the choice of utility function, varying from space filling to clustered designs and mixtures of these, but given the utility function, designs are relatively robust to the type of response variable.
Resumo:
Statistical modeling of traffic crashes has been of interest to researchers for decades. Over the most recent decade many crash models have accounted for extra-variation in crash counts—variation over and above that accounted for by the Poisson density. The extra-variation – or dispersion – is theorized to capture unaccounted for variation in crashes across sites. The majority of studies have assumed fixed dispersion parameters in over-dispersed crash models—tantamount to assuming that unaccounted for variation is proportional to the expected crash count. Miaou and Lord [Miaou, S.P., Lord, D., 2003. Modeling traffic crash-flow relationships for intersections: dispersion parameter, functional form, and Bayes versus empirical Bayes methods. Transport. Res. Rec. 1840, 31–40] challenged the fixed dispersion parameter assumption, and examined various dispersion parameter relationships when modeling urban signalized intersection accidents in Toronto. They suggested that further work is needed to determine the appropriateness of the findings for rural as well as other intersection types, to corroborate their findings, and to explore alternative dispersion functions. This study builds upon the work of Miaou and Lord, with exploration of additional dispersion functions, the use of an independent data set, and presents an opportunity to corroborate their findings. Data from Georgia are used in this study. A Bayesian modeling approach with non-informative priors is adopted, using sampling-based estimation via Markov Chain Monte Carlo (MCMC) and the Gibbs sampler. A total of eight model specifications were developed; four of them employed traffic flows as explanatory factors in mean structure while the remainder of them included geometric factors in addition to major and minor road traffic flows. The models were compared and contrasted using the significance of coefficients, standard deviance, chi-square goodness-of-fit, and deviance information criteria (DIC) statistics. The findings indicate that the modeling of the dispersion parameter, which essentially explains the extra-variance structure, depends greatly on how the mean structure is modeled. In the presence of a well-defined mean function, the extra-variance structure generally becomes insignificant, i.e. the variance structure is a simple function of the mean. It appears that extra-variation is a function of covariates when the mean structure (expected crash count) is poorly specified and suffers from omitted variables. In contrast, when sufficient explanatory variables are used to model the mean (expected crash count), extra-Poisson variation is not significantly related to these variables. If these results are generalizable, they suggest that model specification may be improved by testing extra-variation functions for significance. They also suggest that known influences of expected crash counts are likely to be different than factors that might help to explain unaccounted for variation in crashes across sites
Resumo:
Catchment and riparian degradation has resulted in declining ecosystem health of streams worldwide. With restoration a priority in many regions, there is an increasing interest in the scale at which land use influences stream ecosystem health. Our goal was to use a substantial data set collected as part of a monitoring program (the Southeast Queensland, Australia, Ecological Health Monitoring Program data set, collected at 116 sites over six years) to identify the spatial scale of land use, or the combination of spatial scales, that most strongly influences overall ecosystem health. In addition, we aimed to determine whether the most influential scale differed for different aspects of ecosystem health. We used linear-mixed models and a Bayesian model-averaging approach to generate models for the overall aggregated ecosystem health score and for each of the five component indicators (fish, macroinvertebrates, water quality, nutrients, and ecosystem processes) that make up the score. Dense forest close to the survey site, mid-dense forest in the hydrologically active nearstream areas of the catchment, urbanization in the riparian buffer, and tree cover at the reach scale were all significant in explaining ecosystem health, suggesting an overriding influence of forest cover, particularly close to the stream. Season and antecedent rainfall were also important explanatory variables, with some land-use variables showing significant seasonal interactions. There were also differential influences of land use for each of the component indicators. Our approach is useful given that restoring general ecosystem health is the focus of many stream restoration projects; it allowed us to predict the scale and catchment position of restoration that would result in the greatest improvement of ecosystem health in the regions streams and rivers. The models we generated suggested that good ecosystem health can be maintained in catchments where 80% of hydrologically active areas in close proximity to the stream have mid-dense forest cover and moderate health can be obtained with 60% cover.