846 resultados para generalized linear models


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Aim Our aims were to compare the composition of testate amoeba (TA) communities from Santa Cruz Island, Galápagos Archipelago, which are likely in existence only as a result of anthropogenic habitat transformation, with similar naturally occurring communities from northern and southern continental peatlands. Additionally, we aimed at assessing the importance of niche-based and dispersal-based processes in determining community composition and taxonomic and functional diversity. Location The humid highlands of the central island of Santa Cruz, Galápagos Archipelago. Methods We survey the alpha, beta and gamma taxonomic and functional diversities of TA, and the changes in functional traits along a gradient of wet to dry habitats. We compare the TA community composition, abundance and frequency recorded in the insular peatlands with that recorded in continental peatlands of Northern and Southern Hemispheres. We use generalized linear models to determine how environmental conditions influence taxonomic and functional diversity as well as the mean values of functional traits within communities. We finally apply variance partitioning to assess the relative importance of niche- and dispersal-based processes in determining community composition. Results TA communities in Santa Cruz Island were different from their Northern Hemisphere and South American counterparts with most genera considered as characteristic for Northern Hemisphere and South American Sphagnum peatlands missing or very rare in the Galápagos. Functional traits were most correlated with elevation and site topography and alpha functional diversity to the type of material sampled and site topography. Community composition was more strongly correlated with spatial variables than with environmental ones. Main conclusions TA communities of the Sphagnum peatlands of Santa Cruz Island and the mechanisms shaping these communities contrast with Northern Hemisphere and South American peatlands. Soil moisture was not a strong predictor of community composition most likely because rainfall and clouds provide sufficient moisture. Dispersal limitation was more important than environmental filtering because of the isolation of the insular peatlands from continental ones and the young ecological history of these ecosystems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Secchi depth is a measure of water transparency. In the Baltic Sea region, Secchi depth maps are used to assess eutrophication and as input for habitat models. Due to their spatial and temporal coverage, satellite data would be the most suitable data source for such maps. But the Baltic Sea's optical properties are so different from the open ocean that globally calibrated standard models suffer from large errors. Regional predictive models that take the Baltic Sea's special optical properties into account are thus needed. This paper tests how accurately generalized linear models (GLMs) and generalized additive models (GAMs) with MODIS/Aqua and auxiliary data as inputs can predict Secchi depth at a regional scale. It uses cross-validation to test the prediction accuracy of hundreds of GAMs and GLMs with up to 5 input variables. A GAM with 3 input variables (chlorophyll a, remote sensing reflectance at 678 nm, and long-term mean salinity) made the most accurate predictions. Tested against field observations not used for model selection and calibration, the best model's mean absolute error (MAE) for daily predictions was 1.07 m (22%), more than 50% lower than for other publicly available Baltic Sea Secchi depth maps. The MAE for predicting monthly averages was 0.86 m (15%). Thus, the proposed model selection process was able to find a regional model with good prediction accuracy. It could be useful to find predictive models for environmental variables other than Secchi depth, using data from other satellite sensors, and for other regions where non-standard remote sensing models are needed for prediction and mapping. Annual and monthly mean Secchi depth maps for 2003-2012 come with this paper as Supplementary materials.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The principal risks in the railway industry are mainly associated with collisions, derailments and level crossing accidents. An understanding of the nature of previous accidents on the railway network is required to identify potential causes and develop safety systems and deploy safety procedures. Risk assessment is a process for determining the risk magnitude to assist with decision-making. We propose a three-step methodology to predict the mean number of fatalities in railway accidents. The first is to predict the mean number of accidents by analyzing generalized linear models and selecting the one that best fits to the available historical data on the basis of goodness-offit statistics. The second is to compute the mean number of fatalities per accident and the third is to estimate the mean number of fatalities. The methodology is illustrated on the Spanish railway system. Statistical models accounting for annual and grouped data for the 1992-2009 time period have been analyzed. After identifying the models for broad and narrow gauges, we predicted mean number of accidents and the number of fatalities for the 2010-18 time period.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-06

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Master's)--University of Washington, 2016-06

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents an effective decision making system for leak detection based on multiple generalized linear models and clustering techniques. The training data for the proposed decision system is obtained by setting up an experimental pipeline fully operational distribution system. The system is also equipped with data logging for three variables; namely, inlet pressure, outlet pressure, and outlet flow. The experimental setup is designed such that multi-operational conditions of the distribution system, including multi pressure and multi flow can be obtained. We then statistically tested and showed that pressure and flow variables can be used as signature of leak under the designed multi-operational conditions. It is then shown that the detection of leakages based on the training and testing of the proposed multi model decision system with pre data clustering, under multi operational conditions produces better recognition rates in comparison to the training based on the single model approach. This decision system is then equipped with the estimation of confidence limits and a method is proposed for using these confidence limits for obtaining more robust leakage recognition results.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

2002 Mathematics Subject Classification: 62M10.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

2000 Mathematics Subject Classification: 62P10, 62J12.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper explains how Poisson regression can be used in studies in which the dependent variable describes the number of occurrences of some rare event such as suicide. After pointing out why ordinary linear regression is inappropriate for treating dependent variables of this sort, we go on to present the basic Poisson regression model and show how it fits in the broad class of generalized linear models. Then we turn to discussing a major problem of Poisson regression known as overdispersion and suggest possible solutions, including the correction of standard errors and negative binomial regression. The paper ends with a detailed empirical example, drawn from our own research on suicide.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.

Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.

One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.

Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.

Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.

The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.

Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

[EN]To compare the one year effect of two dietary interventions with MeDiet on GL and GI in the PREDIMED trial. Methods. Participants were older subjects at high risk for cardiovascular disease. This analysis included 2866 nondiabetic subjects. Diet was assessed with a validated 137-item food frequency questionnaire (FFQ). The GI of each FFQ item was assigned by a 5-step methodology using the International Tables of GI and GL Values. Generalized linear models were fitted to assess the relationship between the intervention group and dietary GL and GI at one year of follow-up, using control group as reference.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Endogenous and environmental variables are fundamental in explaining variations in fish condition. Based on more than 20 yr of fish weight and length data, relative condition indices were computed for anchovy and sardine caught in the Gulf of Lions. Classification and regression trees (CART) were used to identify endogenous factors affecting fish condition, and to group years of similar condition. Both species showed a similar annual cycle with condition being minimal in February and maximal in July. CART identified 3 groups of years where the fish populations generally showed poor, average and good condition and within which condition differed between age classes but not according to sex. In particular, during the period of poor condition (mostly recent years), sardines older than 1 yr appeared to be more strongly affected than younger individuals. Time-series were analyzed using generalized linear models (GLMs) to examine the effects of oceanographic abiotic (temperature, Western Mediterranean Oscillation [WeMO] and Rhone outflow) and biotic (chlorophyll a and 6 plankton classes) factors on fish condition. The selected models explained 48 and 35% of the variance of anchovy and sardine condition, respectively. Sardine condition was negatively related to temperature but positively related to the WeMO and mesozooplankton and diatom concentrations. A positive effect of mesozooplankton and Rhone runoff on anchovy condition was detected. The importance of increasing temperatures and reduced water mixing in the NW Mediterranean Sea, affecting planktonic productivity and thus fish condition by bottom-up control processes, was highlighted by these results. Changes in plankton quality, quantity and phenology could lead to insufficient or inadequate food supply for both species.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Breast milk is regarded as an ideal source of nutrients for the growth and development of neonates, but it can also be a potential source of pollutants. Mothers can be exposed to different contaminants as a result of their lifestyle and environmental pollution. Mercury (Hg) and arsenic (As) could adversely affect the development of fetal and neonatal nervous system. Some fish and shellfish are rich in selenium (Se), an essential trace element that forms part of several enzymes related to the detoxification process, including glutathione S-transferase (GST). The goal of this study was to determine the interaction between Hg, As and Se and analyze its effect on the activity of GST in breast milk. Milk samples were collected from women between day 7 and 10 postpartum. The GST activity was determined spectrophotometrically; total Hg, As and Se concentrations were measured by atomic absorption spectrometry. To explain the possible association of Hg, As and Se concentrations with GST activity in breast milk, generalized linear models were constructed. The model explained 44% of the GST activity measured in breast milk. The GLM suggests that GST activity was positively correlated with Hg, As and Se concentrations. The activity of the enzyme was also explained by the frequency of consumption of marine fish and shellfish in the diet of the breastfeeding women.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several recent offsite recreational fishing surveys have used public landline telephone directories as a sampling frame. Sampling biases inherent in this method are recognised, but are assumed to be corrected through demographic data expansion. However, the rising prevalence of mobile-only households has potentially increased these biases by skewing raw samples towards households that maintain relatively high levels of coverage in telephone directories. For biases to be corrected through demographic expansion, both the fishing participation rate and fishing activity must be similar among listed and unlisted fishers within each demographic group. In this study, we tested for a difference in the fishing activity of listed and unlisted fishers within demographic groups by comparing their avidity (number of fishing trips per year), as well as the platform used (boat or shore) and species targeted on their most recent fishing trip. 3062 recreational fishers were interviewed at 34 tackle stores across 12 residential regions of Queensland, Australia. For each fisher, data collected included their fishing avidity, the platform used and species targeted on their most recent trip, their gender, age, residential region, and whether their household had a listed telephone number. Although the most avid fishers were younger and less likely to have a listed phone number, cumulative link models revealed that avidity was not affected by an interaction of phone listing status, age group and residential region (p > 0.05). Likewise, binomial generalized linear models revealed that there was no interaction between phone listing, age group and avidity acting on platform (p > 0.05), and platform was not affected by an interaction of phone listing status, age group, and residential region (p > 0.05). Ordination of target species using Bray-Curtis dissimilarity indices found a significant but irrelevant difference (i.e. small effect size) between listed and unlisted fishers (ANOSIM R < 0.05, p < 0.05). These results suggest that, at this time, the fishing activity of listed and unlisted fishers in Queensland is similar within demographic groups. Future research seeking to validate the assumptions of recreational fishing telephone surveys should investigate fishing participation rates of listed and unlisted fishers within demographic groups.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Undoubtedly, statistics has become one of the most important subjects in the modern world, where its applications are ubiquitous. The importance of statistics is not limited to statisticians, but also impacts upon non-statisticians who have to use statistics within their own disciplines. Several studies have indicated that most of the academic departments around the world have realized the importance of statistics to non-specialist students. Therefore, the number of students enrolled in statistics courses has vastly increased, coming from a variety of disciplines. Consequently, research within the scope of statistics education has been able to develop throughout the last few years. One important issue is how statistics is best taught to, and learned by, non-specialist students. This issue is controlled by several factors that affect the learning and teaching of statistics to non-specialist students, such as the use of technology, the role of the English language (especially for those whose first language is not English), the effectiveness of statistics teachers and their approach towards teaching statistics courses, students’ motivation to learn statistics and the relevance of statistics courses to the main subjects of non-specialist students. Several studies, focused on aspects of learning and teaching statistics, have been conducted in different countries around the world, particularly in Western countries. Conversely, the situation in Arab countries, especially in Saudi Arabia, is different; here, there is very little research in this scope, and what there is does not meet the needs of those countries towards the development of learning and teaching statistics to non-specialist students. This research was instituted in order to develop the field of statistics education. The purpose of this mixed methods study was to generate new insights into this subject by investigating how statistics courses are currently taught to non-specialist students in Saudi universities. Hence, this study will contribute towards filling the knowledge gap that exists in Saudi Arabia. This study used multiple data collection approaches, including questionnaire surveys from 1053 non-specialist students who had completed at least one statistics course in different colleges of the universities in Saudi Arabia. These surveys were followed up with qualitative data collected via semi-structured interviews with 16 teachers of statistics from colleges within all six universities where statistics is taught to non-specialist students in Saudi Arabia’s Eastern Region. The data from questionnaires included several types, so different techniques were used in analysis. Descriptive statistics were used to identify the demographic characteristics of the participants. The chi-square test was used to determine associations between variables. Based on the main issues that are raised from literature review, the questions (items scales) were grouped and five key groups of questions were obtained which are: 1) Effectiveness of Teachers; 2) English Language; 3) Relevance of Course; 4) Student Engagement; 5) Using Technology. Exploratory data analysis was used to explore these issues in more detail. Furthermore, with the existence of clustering in the data (students within departments within colleges, within universities), multilevel generalized linear models for dichotomous analysis have been used to clarify the effects of clustering at those levels. Factor analysis was conducted confirming the dimension reduction of variables (items scales). The data from teachers’ interviews were analysed on an individual basis. The responses were assigned to one of the eight themes that emerged from within the data: 1) the lack of students’ motivation to learn statistics; 2) students' participation; 3) students’ assessment; 4) the effective use of technology; 5) the level of previous mathematical and statistical skills of non-specialist students; 6) the English language ability of non-specialist students; 7) the need for extra time for teaching and learning statistics; and 8) the role of administrators. All the data from students and teachers indicated that the situation of learning and teaching statistics to non-specialist students in Saudi universities needs to be improved in order to meet the needs of those students. The findings of this study suggested a weakness in the use of statistical software applications in these courses. This study showed that there is lack of application of technology such as statistical software programs in these courses, which would allow non-specialist students to consolidate their knowledge. The results also indicated that English language is considered one of the main challenges in learning and teaching statistics, particularly in institutions where English is not used as the main language. Moreover, the weakness of mathematical skills of students is considered another major challenge. Additionally, the results indicated that there was a need to tailor statistics courses to the needs of non-specialist students based on their main subjects. The findings indicate that statistics teachers need to choose appropriate methods when teaching statistics courses.