868 resultados para Zero-inflated models, Statistical models, Poisson, Negative binomial, Statistical methods
Resumo:
The intent of this note is to succinctly articulate additional points that were not provided in the original paper (Lord et al., 2005) and to help clarify a collective reluctance to adopt zero-inflated (ZI) models for modeling highway safety data. A dialogue on this important issue, just one of many important safety modeling issues, is healthy discourse on the path towards improved safety modeling. This note first provides a summary of prior findings and conclusions of the original paper. It then presents two critical and relevant issues: the maximizing statistical fit fallacy and logic problems with the ZI model in highway safety modeling. Finally, we provide brief conclusions.
Resumo:
There has been considerable research conducted over the last 20 years focused on predicting motor vehicle crashes on transportation facilities. The range of statistical models commonly applied includes binomial, Poisson, Poisson-gamma (or negative binomial), zero-inflated Poisson and negative binomial models (ZIP and ZINB), and multinomial probability models. Given the range of possible modeling approaches and the host of assumptions with each modeling approach, making an intelligent choice for modeling motor vehicle crash data is difficult. There is little discussion in the literature comparing different statistical modeling approaches, identifying which statistical models are most appropriate for modeling crash data, and providing a strong justification from basic crash principles. In the recent literature, it has been suggested that the motor vehicle crash process can successfully be modeled by assuming a dual-state data-generating process, which implies that entities (e.g., intersections, road segments, pedestrian crossings, etc.) exist in one of two states—perfectly safe and unsafe. As a result, the ZIP and ZINB are two models that have been applied to account for the preponderance of “excess” zeros frequently observed in crash count data. The objective of this study is to provide defensible guidance on how to appropriate model crash data. We first examine the motor vehicle crash process using theoretical principles and a basic understanding of the crash process. It is shown that the fundamental crash process follows a Bernoulli trial with unequal probability of independent events, also known as Poisson trials. We examine the evolution of statistical models as they apply to the motor vehicle crash process, and indicate how well they statistically approximate the crash process. We also present the theory behind dual-state process count models, and note why they have become popular for modeling crash data. A simulation experiment is then conducted to demonstrate how crash data give rise to “excess” zeros frequently observed in crash data. It is shown that the Poisson and other mixed probabilistic structures are approximations assumed for modeling the motor vehicle crash process. Furthermore, it is demonstrated that under certain (fairly common) circumstances excess zeros are observed—and that these circumstances arise from low exposure and/or inappropriate selection of time/space scales and not an underlying dual state process. In conclusion, carefully selecting the time/space scales for analysis, including an improved set of explanatory variables and/or unobserved heterogeneity effects in count regression models, or applying small-area statistical methods (observations with low exposure) represent the most defensible modeling approaches for datasets with a preponderance of zeros
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Boston Harbor has had a history of poor water quality, including contamination by enteric pathogens. We conduct a statistical analysis of data collected by the Massachusetts Water Resources Authority (MWRA) between 1996 and 2002 to evaluate the effects of court-mandated improvements in sewage treatment. Motivated by the ineffectiveness of standard Poisson mixture models and their zero-inflated counterparts, we propose a new negative binomial model for time series of Enterococcus counts in Boston Harbor, where nonstationarity and autocorrelation are modeled using a nonparametric smooth function of time in the predictor. Without further restrictions, this function is not identifiable in the presence of time-dependent covariates; consequently we use a basis orthogonal to the space spanned by the covariates and use penalized quasi-likelihood (PQL) for estimation. We conclude that Enterococcus counts were greatly reduced near the Nut Island Treatment Plant (NITP) outfalls following the transfer of wastewaters from NITP to the Deer Island Treatment Plant (DITP) and that the transfer of wastewaters from Boston Harbor to the offshore diffusers in Massachusetts Bay reduced the Enterococcus counts near the DITP outfalls.
Resumo:
Background: Developing sampling strategies to target biological pests such as insects in stored grain is inherently difficult owing to species biology and behavioural characteristics. The design of robust sampling programmes should be based on an underlying statistical distribution that is sufficiently flexible to capture variations in the spatial distribution of the target species. Results: Comparisons are made of the accuracy of four probability-of-detection sampling models - the negative binomial model,1 the Poisson model,1 the double logarithmic model2 and the compound model3 - for detection of insects over a broad range of insect densities. Although the double log and negative binomial models performed well under specific conditions, it is shown that, of the four models examined, the compound model performed the best over a broad range of insect spatial distributions and densities. In particular, this model predicted well the number of samples required when insect density was high and clumped within experimental storages. Conclusions: This paper reinforces the need for effective sampling programs designed to detect insects over a broad range of spatial distributions. The compound model is robust over a broad range of insect densities and leads to substantial improvement in detection probabilities within highly variable systems such as grain storage.
Resumo:
In this study, we deal with the problem of overdispersion beyond extra zeros for a collection of counts that can be correlated. Poisson, negative binomial, zero-inflated Poisson and zero-inflated negative binomial distributions have been considered. First, we propose a multivariate count model in which all counts follow the same distribution and are correlated. Then we extend this model in a sense that correlated counts may follow different distributions. To accommodate correlation among counts, we have considered correlated random effects for each individual in the mean structure, thus inducing dependency among common observations to an individual. The method is applied to real data to investigate variation in food resources use in a species of marsupial in a locality of the Brazilian Cerrado biome. © 2013 Copyright Taylor and Francis Group, LLC.
Resumo:
Environmental data are spatial, temporal, and often come with many zeros. In this paper, we included space–time random effects in zero-inflated Poisson (ZIP) and ‘hurdle’ models to investigate haulout patterns of harbor seals on glacial ice. The data consisted of counts, for 18 dates on a lattice grid of samples, of harbor seals hauled out on glacial ice in Disenchantment Bay, near Yakutat, Alaska. A hurdle model is similar to a ZIP model except it does not mix zeros from the binary and count processes. Both models can be used for zero-inflated data, and we compared space–time ZIP and hurdle models in a Bayesian hierarchical model. Space–time ZIP and hurdle models were constructed by using spatial conditional autoregressive (CAR) models and temporal first-order autoregressive (AR(1)) models as random effects in ZIP and hurdle regression models. We created maps of smoothed predictions for harbor seal counts based on ice density, other covariates, and spatio-temporal random effects. For both models predictions around the edges appeared to be positively biased. The linex loss function is an asymmetric loss function that penalizes overprediction more than underprediction, and we used it to correct for prediction bias to get the best map for space–time ZIP and hurdle models.
Resumo:
Les données comptées (count data) possèdent des distributions ayant des caractéristiques particulières comme la non-normalité, l’hétérogénéité des variances ainsi qu’un nombre important de zéros. Il est donc nécessaire d’utiliser les modèles appropriés afin d’obtenir des résultats non biaisés. Ce mémoire compare quatre modèles d’analyse pouvant être utilisés pour les données comptées : le modèle de Poisson, le modèle binomial négatif, le modèle de Poisson avec inflation du zéro et le modèle binomial négatif avec inflation du zéro. À des fins de comparaisons, la prédiction de la proportion du zéro, la confirmation ou l’infirmation des différentes hypothèses ainsi que la prédiction des moyennes furent utilisées afin de déterminer l’adéquation des différents modèles. Pour ce faire, le nombre d’arrestations des membres de gangs de rue sur le territoire de Montréal fut utilisé pour la période de 2005 à 2007. L’échantillon est composé de 470 hommes, âgés de 18 à 59 ans. Au terme des analyses, le modèle le plus adéquat est le modèle binomial négatif puisque celui-ci produit des résultats significatifs, s’adapte bien aux données observées et produit une proportion de zéro très similaire à celle observée.
Resumo:
In regression analysis of counts, a lack of simple and efficient algorithms for posterior computation has made Bayesian approaches appear unattractive and thus underdeveloped. We propose a lognormal and gamma mixed negative binomial (NB) regression model for counts, and present efficient closed-form Bayesian inference; unlike conventional Poisson models, the proposed approach has two free parameters to include two different kinds of random effects, and allows the incorporation of prior information, such as sparsity in the regression coefficients. By placing a gamma distribution prior on the NB dispersion parameter r, and connecting a log-normal distribution prior with the logit of the NB probability parameter p, efficient Gibbs sampling and variational Bayes inference are both developed. The closed-form updates are obtained by exploiting conditional conjugacy via both a compound Poisson representation and a Polya-Gamma distribution based data augmentation approach. The proposed Bayesian inference can be implemented routinely, while being easily generalizable to more complex settings involving multivariate dependence structures. The algorithms are illustrated using real examples. Copyright 2012 by the author(s)/owner(s).
Resumo:
Count data with excess zeros relative to a Poisson distribution are common in many biomedical applications. A popular approach to the analysis of such data is to use a zero-inflated Poisson (ZIP) regression model. Often, because of the hierarchical Study design or the data collection procedure, zero-inflation and lack of independence may occur simultaneously, which tender the standard ZIP model inadequate. To account for the preponderance of zero counts and the inherent correlation of observations, a class of multi-level ZIP regression model with random effects is presented. Model fitting is facilitated using an expectation-maximization algorithm, whereas variance components are estimated via residual maximum likelihood estimating equations. A score test for zero-inflation is also presented. The multi-level ZIP model is then generalized to cope with a more complex correlation structure. Application to the analysis of correlated count data from a longitudinal infant feeding study illustrates the usefulness of the approach.
Resumo:
Temporal replicate counts are often aggregated to improve model fit by reducing zero-inflation and count variability, and in the case of migration counts collected hourly throughout a migration, allows one to ignore nonindependence. However, aggregation can represent a loss of potentially useful information on the hourly or seasonal distribution of counts, which might impact our ability to estimate reliable trends. We simulated 20-year hourly raptor migration count datasets with known rate of change to test the effect of aggregating hourly counts to daily or annual totals on our ability to recover known trend. We simulated data for three types of species, to test whether results varied with species abundance or migration strategy: a commonly detected species, e.g., Northern Harrier, Circus cyaneus; a rarely detected species, e.g., Peregrine Falcon, Falco peregrinus; and a species typically counted in large aggregations with overdispersed counts, e.g., Broad-winged Hawk, Buteo platypterus. We compared accuracy and precision of estimated trends across species and count types (hourly/daily/annual) using hierarchical models that assumed a Poisson, negative binomial (NB) or zero-inflated negative binomial (ZINB) count distribution. We found little benefit of modeling zero-inflation or of modeling the hourly distribution of migration counts. For the rare species, trends analyzed using daily totals and an NB or ZINB data distribution resulted in a higher probability of detecting an accurate and precise trend. In contrast, trends of the common and overdispersed species benefited from aggregation to annual totals, and for the overdispersed species in particular, trends estimating using annual totals were more precise, and resulted in lower probabilities of estimating a trend (1) in the wrong direction, or (2) with credible intervals that excluded the true trend, as compared with hourly and daily counts.
Resumo:
At least two important transportation planning activities rely on planning-level crash prediction models. One is motivated by the Transportation Equity Act for the 21st Century, which requires departments of transportation and metropolitan planning organizations to consider safety explicitly in the transportation planning process. The second could arise from a need for state agencies to establish incentive programs to reduce injuries and save lives. Both applications require a forecast of safety for a future period. Planning-level crash prediction models for the Tucson, Arizona, metropolitan region are presented to demonstrate the feasibility of such models. Data were separated into fatal, injury, and property-damage crashes. To accommodate overdispersion in the data, negative binomial regression models were applied. To accommodate the simultaneity of fatality and injury crash outcomes, simultaneous estimation of the models was conducted. All models produce crash forecasts at the traffic analysis zone level. Statistically significant (p-values < 0.05) and theoretically meaningful variables for the fatal crash model included population density, persons 17 years old or younger as a percentage of the total population, and intersection density. Significant variables for the injury and property-damage crash models were population density, number of employees, intersections density, percentage of miles of principal arterial, percentage of miles of minor arterials, and percentage of miles of urban collectors. Among several conclusions it is suggested that planning-level safety models are feasible and may play a role in future planning activities. However, caution must be exercised with such models.
Resumo:
In many data sets from clinical studies there are patients insusceptible to the occurrence of the event of interest. Survival models which ignore this fact are generally inadequate. The main goal of this paper is to describe an application of the generalized additive models for location, scale, and shape (GAMLSS) framework to the fitting of long-term survival models. in this work the number of competing causes of the event of interest follows the negative binomial distribution. In this way, some well known models found in the literature are characterized as particular cases of our proposal. The model is conveniently parameterized in terms of the cured fraction, which is then linked to covariates. We explore the use of the gamlss package in R as a powerful tool for inference in long-term survival models. The procedure is illustrated with a numerical example. (C) 2009 Elsevier Ireland Ltd. All rights reserved.
Resumo:
In this article, for the first time, we propose the negative binomial-beta Weibull (BW) regression model for studying the recurrence of prostate cancer and to predict the cure fraction for patients with clinically localized prostate cancer treated by open radical prostatectomy. The cure model considers that a fraction of the survivors are cured of the disease. The survival function for the population of patients can be modeled by a cure parametric model using the BW distribution. We derive an explicit expansion for the moments of the recurrence time distribution for the uncured individuals. The proposed distribution can be used to model survival data when the hazard rate function is increasing, decreasing, unimodal and bathtub shaped. Another advantage is that the proposed model includes as special sub-models some of the well-known cure rate models discussed in the literature. We derive the appropriate matrices for assessing local influence on the parameter estimates under different perturbation schemes. We analyze a real data set for localized prostate cancer patients after open radical prostatectomy.
Resumo:
Emrouznejad et al. (2010) proposed a Semi-Oriented Radial Measure (SORM) model for assessing the efficiency of Decision Making Units (DMUs) by Data Envelopment Analysis (DEA) with negative data. This paper provides a necessary and sufficient condition for boundedness of the input and output oriented SORM models.