750 resultados para zero-inflated data
em Queensland University of Technology - ePrints Archive
Resumo:
There has been considerable research conducted over the last 20 years focused on predicting motor vehicle crashes on transportation facilities. The range of statistical models commonly applied includes binomial, Poisson, Poisson-gamma (or negative binomial), zero-inflated Poisson and negative binomial models (ZIP and ZINB), and multinomial probability models. Given the range of possible modeling approaches and the host of assumptions with each modeling approach, making an intelligent choice for modeling motor vehicle crash data is difficult. There is little discussion in the literature comparing different statistical modeling approaches, identifying which statistical models are most appropriate for modeling crash data, and providing a strong justification from basic crash principles. In the recent literature, it has been suggested that the motor vehicle crash process can successfully be modeled by assuming a dual-state data-generating process, which implies that entities (e.g., intersections, road segments, pedestrian crossings, etc.) exist in one of two states—perfectly safe and unsafe. As a result, the ZIP and ZINB are two models that have been applied to account for the preponderance of “excess” zeros frequently observed in crash count data. The objective of this study is to provide defensible guidance on how to appropriate model crash data. We first examine the motor vehicle crash process using theoretical principles and a basic understanding of the crash process. It is shown that the fundamental crash process follows a Bernoulli trial with unequal probability of independent events, also known as Poisson trials. We examine the evolution of statistical models as they apply to the motor vehicle crash process, and indicate how well they statistically approximate the crash process. We also present the theory behind dual-state process count models, and note why they have become popular for modeling crash data. A simulation experiment is then conducted to demonstrate how crash data give rise to “excess” zeros frequently observed in crash data. It is shown that the Poisson and other mixed probabilistic structures are approximations assumed for modeling the motor vehicle crash process. Furthermore, it is demonstrated that under certain (fairly common) circumstances excess zeros are observed—and that these circumstances arise from low exposure and/or inappropriate selection of time/space scales and not an underlying dual state process. In conclusion, carefully selecting the time/space scales for analysis, including an improved set of explanatory variables and/or unobserved heterogeneity effects in count regression models, or applying small-area statistical methods (observations with low exposure) represent the most defensible modeling approaches for datasets with a preponderance of zeros
Resumo:
The intent of this note is to succinctly articulate additional points that were not provided in the original paper (Lord et al., 2005) and to help clarify a collective reluctance to adopt zero-inflated (ZI) models for modeling highway safety data. A dialogue on this important issue, just one of many important safety modeling issues, is healthy discourse on the path towards improved safety modeling. This note first provides a summary of prior findings and conclusions of the original paper. It then presents two critical and relevant issues: the maximizing statistical fit fallacy and logic problems with the ZI model in highway safety modeling. Finally, we provide brief conclusions.
Resumo:
The main objective of this PhD was to further develop Bayesian spatio-temporal models (specifically the Conditional Autoregressive (CAR) class of models), for the analysis of sparse disease outcomes such as birth defects. The motivation for the thesis arose from problems encountered when analyzing a large birth defect registry in New South Wales. The specific components and related research objectives of the thesis were developed from gaps in the literature on current formulations of the CAR model, and health service planning requirements. Data from a large probabilistically-linked database from 1990 to 2004, consisting of fields from two separate registries: the Birth Defect Registry (BDR) and Midwives Data Collection (MDC) were used in the analyses in this thesis. The main objective was split into smaller goals. The first goal was to determine how the specification of the neighbourhood weight matrix will affect the smoothing properties of the CAR model, and this is the focus of chapter 6. Secondly, I hoped to evaluate the usefulness of incorporating a zero-inflated Poisson (ZIP) component as well as a shared-component model in terms of modeling a sparse outcome, and this is carried out in chapter 7. The third goal was to identify optimal sampling and sample size schemes designed to select individual level data for a hybrid ecological spatial model, and this is done in chapter 8. Finally, I wanted to put together the earlier improvements to the CAR model, and along with demographic projections, provide forecasts for birth defects at the SLA level. Chapter 9 describes how this is done. For the first objective, I examined a series of neighbourhood weight matrices, and showed how smoothing the relative risk estimates according to similarity by an important covariate (i.e. maternal age) helped improve the model’s ability to recover the underlying risk, as compared to the traditional adjacency (specifically the Queen) method of applying weights. Next, to address the sparseness and excess zeros commonly encountered in the analysis of rare outcomes such as birth defects, I compared a few models, including an extension of the usual Poisson model to encompass excess zeros in the data. This was achieved via a mixture model, which also encompassed the shared component model to improve on the estimation of sparse counts through borrowing strength across a shared component (e.g. latent risk factor/s) with the referent outcome (caesarean section was used in this example). Using the Deviance Information Criteria (DIC), I showed how the proposed model performed better than the usual models, but only when both outcomes shared a strong spatial correlation. The next objective involved identifying the optimal sampling and sample size strategy for incorporating individual-level data with areal covariates in a hybrid study design. I performed extensive simulation studies, evaluating thirteen different sampling schemes along with variations in sample size. This was done in the context of an ecological regression model that incorporated spatial correlation in the outcomes, as well as accommodating both individual and areal measures of covariates. Using the Average Mean Squared Error (AMSE), I showed how a simple random sample of 20% of the SLAs, followed by selecting all cases in the SLAs chosen, along with an equal number of controls, provided the lowest AMSE. The final objective involved combining the improved spatio-temporal CAR model with population (i.e. women) forecasts, to provide 30-year annual estimates of birth defects at the Statistical Local Area (SLA) level in New South Wales, Australia. The projections were illustrated using sixteen different SLAs, representing the various areal measures of socio-economic status and remoteness. A sensitivity analysis of the assumptions used in the projection was also undertaken. By the end of the thesis, I will show how challenges in the spatial analysis of rare diseases such as birth defects can be addressed, by specifically formulating the neighbourhood weight matrix to smooth according to a key covariate (i.e. maternal age), incorporating a ZIP component to model excess zeros in outcomes and borrowing strength from a referent outcome (i.e. caesarean counts). An efficient strategy to sample individual-level data and sample size considerations for rare disease will also be presented. Finally, projections in birth defect categories at the SLA level will be made.
Resumo:
This research assesses the potential impact of weekly weather variability on the incidence of cryptosporidiosis disease using time series zero-inflated Poisson (ZIP) and classification and regression tree (CART) models. Data on weather variables, notified cryptosporidiosis cases and population size in Brisbane were supplied by the Australian Bureau of Meteorology, Queensland Department of Health, and Australian Bureau of Statistics, respectively. Both time series ZIP and CART models show a clear association between weather variables (maximum temperature, relative humidity, rainfall and wind speed) and cryptosporidiosis disease. The time series CART models indicated that, when weekly maximum temperature exceeded 31°C and relative humidity was less than 63%, the relative risk of cryptosporidiosis rose by 13.64 (expected morbidity: 39.4; 95% confidence interval: 30.9–47.9). These findings may have applications as a decision support tool in planning disease control and risk management programs for cryptosporidiosis disease.
Resumo:
Poisson distribution has often been used for count like accident data. Negative Binomial (NB) distribution has been adopted in the count data to take care of the over-dispersion problem. However, Poisson and NB distributions are incapable of taking into account some unobserved heterogeneities due to spatial and temporal effects of accident data. To overcome this problem, Random Effect models have been developed. Again another challenge with existing traffic accident prediction models is the distribution of excess zero accident observations in some accident data. Although Zero-Inflated Poisson (ZIP) model is capable of handling the dual-state system in accident data with excess zero observations, it does not accommodate the within-location correlation and between-location correlation heterogeneities which are the basic motivations for the need of the Random Effect models. This paper proposes an effective way of fitting ZIP model with location specific random effects and for model calibration and assessment the Bayesian analysis is recommended.
Resumo:
S. japonicum infection is believed to be endemic in 28 of the 80 provinces of the Philippines and the most recent data on schistosomiasis prevalence have shown considerable variability between provinces. In order to increase the efficient allocation of parasitic disease control resources in the country, we aimed to describe the small scale spatial variation in S. japonicum prevalence across the Philippines, quantify the role of the physical environment in driving the spatial variation of S. japonicum, and develop a predictive risk map of S. japonicum infection. Data on S. japonicum infection from 35,754 individuals across the country were geo-located at the barangay level and included in the analysis. The analysis was then stratified geographically for Luzon, the Visayas and Mindanao. Zero-inflated binomial Bayesian geostatistical models of S. japonicum prevalence were developed and diagnostic uncertainty was incorporated. Results of the analysis show that in the three regions, males and individuals aged ≥ 20 years had significantly higher prevalence of S. japonicum compared with females and children <5 years. The role of the environmental variables differed between regions of the Philippines. S. japonicum infection was widespread in the Visayas whereas it was much more focal in Luzon and Mindanao. This analysis revealed significant spatial variation in prevalence of S. japonicum infection in the Philippines. This suggests that a spatially targeted approach to schistosomiasis interventions, including mass drug administration, is warranted. When financially possible, additional schistosomiasis surveys should be prioritized to areas identified to be at high risk, but which were underrepresented in our dataset.
Resumo:
This paper develops a semiparametric estimation approach for mixed count regression models based on series expansion for the unknown density of the unobserved heterogeneity. We use the generalized Laguerre series expansion around a gamma baseline density to model unobserved heterogeneity in a Poisson mixture model. We establish the consistency of the estimator and present a computational strategy to implement the proposed estimation techniques in the standard count model as well as in truncated, censored, and zero-inflated count regression models. Monte Carlo evidence shows that the finite sample behavior of the estimator is quite good. The paper applies the method to a model of individual shopping behavior. © 1999 Elsevier Science S.A. All rights reserved.
Resumo:
Established Monte Carlo user codes BEAMnrc and DOSXYZnrc permit the accurate and straightforward simulation of radiotherapy experiments and treatments delivered from multiple beam angles. However, when an electronic portal imaging detector (EPID) is included in these simulations, treatment delivery from non-zero beam angles becomes problematic. This study introduces CTCombine, a purpose-built code for rotating selected CT data volumes, converting CT numbers to mass densities, combining the results with model EPIDs and writing output in a form which can easily be read and used by the dose calculation code DOSXYZnrc. The geometric and dosimetric accuracy of CTCombine’s output has been assessed by simulating simple and complex treatments applied to a rotated planar phantom and a rotated humanoid phantom and comparing the resulting virtual EPID images with the images acquired using experimental measurements and independent simulations of equivalent phantoms. It is expected that CTCombine will be useful for Monte Carlo studies of EPID dosimetry as well as other EPID imaging applications.
Resumo:
This dissertation is primarily an applied statistical modelling investigation, motivated by a case study comprising real data and real questions. Theoretical questions on modelling and computation of normalization constants arose from pursuit of these data analytic questions. The essence of the thesis can be described as follows. Consider binary data observed on a two-dimensional lattice. A common problem with such data is the ambiguity of zeroes recorded. These may represent zero response given some threshold (presence) or that the threshold has not been triggered (absence). Suppose that the researcher wishes to estimate the effects of covariates on the binary responses, whilst taking into account underlying spatial variation, which is itself of some interest. This situation arises in many contexts and the dingo, cypress and toad case studies described in the motivation chapter are examples of this. Two main approaches to modelling and inference are investigated in this thesis. The first is frequentist and based on generalized linear models, with spatial variation modelled by using a block structure or by smoothing the residuals spatially. The EM algorithm can be used to obtain point estimates, coupled with bootstrapping or asymptotic MLE estimates for standard errors. The second approach is Bayesian and based on a three- or four-tier hierarchical model, comprising a logistic regression with covariates for the data layer, a binary Markov Random field (MRF) for the underlying spatial process, and suitable priors for parameters in these main models. The three-parameter autologistic model is a particular MRF of interest. Markov chain Monte Carlo (MCMC) methods comprising hybrid Metropolis/Gibbs samplers is suitable for computation in this situation. Model performance can be gauged by MCMC diagnostics. Model choice can be assessed by incorporating another tier in the modelling hierarchy. This requires evaluation of a normalization constant, a notoriously difficult problem. Difficulty with estimating the normalization constant for the MRF can be overcome by using a path integral approach, although this is a highly computationally intensive method. Different methods of estimating ratios of normalization constants (N Cs) are investigated, including importance sampling Monte Carlo (ISMC), dependent Monte Carlo based on MCMC simulations (MCMC), and reverse logistic regression (RLR). I develop an idea present though not fully developed in the literature, and propose the Integrated mean canonical statistic (IMCS) method for estimating log NC ratios for binary MRFs. The IMCS method falls within the framework of the newly identified path sampling methods of Gelman & Meng (1998) and outperforms ISMC, MCMC and RLR. It also does not rely on simplifying assumptions, such as ignoring spatio-temporal dependence in the process. A thorough investigation is made of the application of IMCS to the three-parameter Autologistic model. This work introduces background computations required for the full implementation of the four-tier model in Chapter 7. Two different extensions of the three-tier model to a four-tier version are investigated. The first extension incorporates temporal dependence in the underlying spatio-temporal process. The second extensions allows the successes and failures in the data layer to depend on time. The MCMC computational method is extended to incorporate the extra layer. A major contribution of the thesis is the development of a fully Bayesian approach to inference for these hierarchical models for the first time. Note: The author of this thesis has agreed to make it open access but invites people downloading the thesis to send her an email via the 'Contact Author' function.
Resumo:
Established Monte Carlo user codes BEAMnrc and DOSXYZnrc permit the accurate and straightforward simulation of radiotherapy experiments and treatments delivered from multiple beam angles. However, when an electronic portal imaging detector (EPID) is included in these simulations, treatment delivery from non-zero beam angles becomes problematic. This study introduces CTCombine, a purpose-built code for rotating selected CT data volumes, converting CT numbers to mass densities, combining the results with model EPIDs and writing output in a form which can easily be read and used by the dose calculation code DOSXYZnrc...
Resumo:
Zero energy buildings (ZEB) and zero energy homes (ZEH) are a current hot topic globally for policy makers (what are the benefits and costs), designers (how do we design them), the construction industry (can we build them), marketing (will consumers buy them) and researchers (do they work and what are the implications). This paper presents initial findings from actual measured data from a 9 star (as built), off-ground detached family home constructed in south-east Queensland in 2008. The integrated systems approach to the design of the house is analysed in each of its three main goals: maximising the thermal performance of the building envelope, minimising energy demand whilst maintaining energy service levels, and implementing a multi-pronged low carbon approach to energy supply. The performance outcomes of each of these stages are evaluated against definitions of Net Zero Carbon / Net Zero Emissions (Site and Source) and Net Zero Energy (onsite generation v primary energy imports). The paper will conclude with a summary of the multiple benefits of combining very high efficiency building envelopes with diverse energy management strategies: a robustness, resilience, affordability and autonomy not generally seen in housing.
Resumo:
Objective: To evaluate the impact of a government triple zero community awareness campaign on the characteristics of patients attending an ED. Methods: A study using Emergency Department Information System data was conducted in an adult metropolitan tertiary-referral teaching hospital in Brisbane. The three outcomes measured in the 3 month post-campaign period were arrival mode, Australasian Triage Scale and departure status. These measures reflect ambulance usage, clinical urgency and illness severity, respectively. They were compared with those in the 3 month pre-campaign period. Multivariate logistic regression models were used to investigate the impacts of the campaign on each of the three outcome measures after controlling for age, sex, day and time of arrival, and daily minimum temperature. Results: There were 17 920 visits in the pre- and 17 793 visits in the post-campaign period. After the campaign, fewer patients arrived at the ED by road ambulance (odds ratio [OR] 0.90, 95% confidence interval [CI] 0.80–1.00), although the impact of the campaign on the arrival mode was only close to statistical significance (Wald χ2-test, P= 0.055); and patients were significantly less likely to have higher clinical urgency (OR 0.86, 95% CI 0.79–0.94), while more likely to be admitted (OR 1.68, 95% CI 1.38–2.05) or complete treatment in the ED (OR 1.46, 95% CI 1.23–1.73) instead of leaving without waiting to be seen. Conclusions: The campaign had no significant impact on the arrival mode of the patients. After the campaign, the illness acuity of the patients decreased, whereas the illness severity of the patients increased.
Resumo:
Background The expansion of cell colonies is driven by a delicate balance of several mechanisms including cell motility, cell-to-cell adhesion and cell proliferation. New approaches that can be used to independently identify and quantify the role of each mechanism will help us understand how each mechanism contributes to the expansion process. Standard mathematical modelling approaches to describe such cell colony expansion typically neglect cell-to-cell adhesion, despite the fact that cell-to-cell adhesion is thought to play an important role. Results We use a combined experimental and mathematical modelling approach to determine the cell diffusivity, D, cell-to-cell adhesion strength, q, and cell proliferation rate, ?, in an expanding colony of MM127 melanoma cells. Using a circular barrier assay, we extract several types of experimental data and use a mathematical model to independently estimate D, q and ?. In our first set of experiments, we suppress cell proliferation and analyse three different types of data to estimate D and q. We find that standard types of data, such as the area enclosed by the leading edge of the expanding colony and more detailed cell density profiles throughout the expanding colony, does not provide sufficient information to uniquely identify D and q. We find that additional data relating to the degree of cell-to-cell clustering is required to provide independent estimates of q, and in turn D. In our second set of experiments, where proliferation is not suppressed, we use data describing temporal changes in cell density to determine the cell proliferation rate. In summary, we find that our experiments are best described using the range D = 161 - 243 ?m2 hour-1, q = 0.3 - 0.5 (low to moderate strength) and ? = 0.0305 - 0.0398 hour-1, and with these parameters we can accurately predict the temporal variations in the spatial extent and cell density profile throughout the expanding melanoma cell colony. Conclusions Our systematic approach to identify the cell diffusivity, cell-to-cell adhesion strength and cell proliferation rate highlights the importance of integrating multiple types of data to accurately quantify the factors influencing the spatial expansion of melanoma cell colonies.
Resumo:
This study extends the ‘zero scan’ method for CT imaging of polymer gel dosimeters to include multi-slice acquisitions. Multi slice CT images consisting of 24 slices of 1.2 mm thickness were acquired of an irradiated polymer gel dosimeter, and processed with the zero scan technique. The results demonstrate that zero scan based gel readout can be successfully applied to generate a three dimensional image of the irradiated gel field. Compared to the raw CT images the processed figures and cross gel profiles demonstrated reduced noise and clear visibility of the penumbral region. Moreover these improved results further highlight the suitability of this method in volumetric reconstruction with reduced CT data acquisition per slice. This work shows that 3D volumes of irradiated polymer gel dosimeters can be acquired and processed with x-ray CT.