990 resultados para Modelos log-linear


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.

Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.

One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.

Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.

Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.

The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.

Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Parking is often underpriced and expanding its capacity is expensive; universities need a better way of reducing congestion outside of building costly parking garages. Demand based pricing mechanisms, such as auctions, offer a possible solution to the problem by promising to reduce parking at peak times. However, faculty, students, and staff at universities have systematically different parking needs, leading to different parking valuations. In this study, I determine the impact university affiliation has on predicting bid values cast in three Dutch Auctions of on-campus parking permits sold at Chapman University in Fall 2010. Using clustering techniques crosschecked with university demographic information to detect affiliation groups, I ran a log-linear regression, finding that university affiliation had a larger effect on bid amount than on lot location and fraction of auction duration. Generally, faculty were predicted to have higher bids whereas students were predicted to have lower bids.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

It is crucial to understand the role that labor market positions might play in creating gender differences in work–life balance. One theoretical approach to understanding this relationship is the spillover theory. The spillover theory argues that an individual’s life domains are integrated; meaning that well-being can be transmitted between life domains. Based on data collected in Hungary in 2014, this paper shows that work-to-family spillover does not affect both genders the same way. The effect of work on family life tends to be more negative for women than for men. Two explanations have been formulated in order to understand this gender inequality. According to the findings of the analysis, gender is conditionally independent of spillover if financial status and flexibility of work are also incorporated into the analysis. This means that the relative disadvantage for women in terms of spillover can be attributed to their lower financial status and their relatively low access to flexible jobs. In other words, the gender inequalities in work-to-family spillover are deeply affected by individual labor market positions. The observation of the labor market’s effect on work–life balance is especially important in Hungary since Hungary has one of the least flexible labor arrangements in Europe. A marginal log-linear model, which is a method for categorical multivariate analysis, has been applied in this analysis.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Excess nutrient loads carried by streams and rivers are a great concern for environmental resource managers. In agricultural regions, excess loads are transported downstream to receiving water bodies, potentially causing algal blooms, which could lead to numerous ecological problems. To better understand nutrient load transport, and to develop appropriate water management plans, it is important to have accurate estimates of annual nutrient loads. This study used a Monte Carlo sub-sampling method and error-corrected statistical models to estimate annual nitrate-N loads from two watersheds in central Illinois. The performance of three load estimation methods (the seven-parameter log-linear model, the ratio estimator, and the flow-weighted averaging estimator) applied at one-, two-, four-, six-, and eight-week sampling frequencies were compared. Five error correction techniques; the existing composite method, and four new error correction techniques developed in this study; were applied to each combination of sampling frequency and load estimation method. On average, the most accurate error reduction technique, (proportional rectangular) resulted in 15% and 30% more accurate load estimates when compared to the most accurate uncorrected load estimation method (ratio estimator) for the two watersheds. Using error correction methods, it is possible to design more cost-effective monitoring plans by achieving the same load estimation accuracy with fewer observations. Finally, the optimum combinations of monitoring threshold and sampling frequency that minimizes the number of samples required to achieve specified levels of accuracy in load estimation were determined. For one- to three-weeks sampling frequencies, combined threshold/fixed-interval monitoring approaches produced the best outcomes, while fixed-interval-only approaches produced the most accurate results for four- to eight-weeks sampling frequencies.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The long-term adverse effects on health associated with air pollution exposure can be estimated using either cohort or spatio-temporal ecological designs. In a cohort study, the health status of a cohort of people are assessed periodically over a number of years, and then related to estimated ambient pollution concentrations in the cities in which they live. However, such cohort studies are expensive and time consuming to implement, due to the long-term follow up required for the cohort. Therefore, spatio-temporal ecological studies are also being used to estimate the long-term health effects of air pollution as they are easy to implement due to the routine availability of the required data. Spatio-temporal ecological studies estimate the health impact of air pollution by utilising geographical and temporal contrasts in air pollution and disease risk across $n$ contiguous small-areas, such as census tracts or electoral wards, for multiple time periods. The disease data are counts of the numbers of disease cases occurring in each areal unit and time period, and thus Poisson log-linear models are typically used for the analysis. The linear predictor includes pollutant concentrations and known confounders such as socio-economic deprivation. However, as the disease data typically contain residual spatial or spatio-temporal autocorrelation after the covariate effects have been accounted for, these known covariates are augmented by a set of random effects. One key problem in these studies is estimating spatially representative pollution concentrations in each areal which are typically estimated by applying Kriging to data from a sparse monitoring network, or by computing averages over modelled concentrations (grid level) from an atmospheric dispersion model. The aim of this thesis is to investigate the health effects of long-term exposure to Nitrogen Dioxide (NO2) and Particular matter (PM10) in mainland Scotland, UK. In order to have an initial impression about the air pollution health effects in mainland Scotland, chapter 3 presents a standard epidemiological study using a benchmark method. The remaining main chapters (4, 5, 6) cover the main methodological focus in this thesis which has been threefold: (i) how to better estimate pollution by developing a multivariate spatio-temporal fusion model that relates monitored and modelled pollution data over space, time and pollutant; (ii) how to simultaneously estimate the joint effects of multiple pollutants; and (iii) how to allow for the uncertainty in the estimated pollution concentrations when estimating their health effects. Specifically, chapters 4 and 5 are developed to achieve (i), while chapter 6 focuses on (ii) and (iii). In chapter 4, I propose an integrated model for estimating the long-term health effects of NO2, that fuses modelled and measured pollution data to provide improved predictions of areal level pollution concentrations and hence health effects. The air pollution fusion model proposed is a Bayesian space-time linear regression model for relating the measured concentrations to the modelled concentrations for a single pollutant, whilst allowing for additional covariate information such as site type (e.g. roadside, rural, etc) and temperature. However, it is known that some pollutants might be correlated because they may be generated by common processes or be driven by similar factors such as meteorology. The correlation between pollutants can help to predict one pollutant by borrowing strength from the others. Therefore, in chapter 5, I propose a multi-pollutant model which is a multivariate spatio-temporal fusion model that extends the single pollutant model in chapter 4, which relates monitored and modelled pollution data over space, time and pollutant to predict pollution across mainland Scotland. Considering that we are exposed to multiple pollutants simultaneously because the air we breathe contains a complex mixture of particle and gas phase pollutants, the health effects of exposure to multiple pollutants have been investigated in chapter 6. Therefore, this is a natural extension to the single pollutant health effects in chapter 4. Given NO2 and PM10 are highly correlated (multicollinearity issue) in my data, I first propose a temporally-varying linear model to regress one pollutant (e.g. NO2) against another (e.g. PM10) and then use the residuals in the disease model as well as PM10, thus investigating the health effects of exposure to both pollutants simultaneously. Another issue considered in chapter 6 is to allow for the uncertainty in the estimated pollution concentrations when estimating their health effects. There are in total four approaches being developed to adjust the exposure uncertainty. Finally, chapter 7 summarises the work contained within this thesis and discusses the implications for future research.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We propose weakly-constrained stream and block codes with tunable pattern-dependent statistics and demonstrate that the block code capacity at large block sizes is close to the the prediction obtained from a simple Markov model published earlier. We demonstrate the feasibility of the code by presenting original encoding and decoding algorithms with a complexity log-linear in the block size and with modest table memory requirements. We also show that when such codes are used for mitigation of patterning effects in optical fibre communications, a gain of about 0.5dB is possible under realistic conditions, at the expense of small redundancy (≈10%). © 2010 IEEE

Relevância:

50.00% 50.00%

Publicador:

Resumo:

La efectividad en el deporte hace referencia al impacto alcanzado por una acción llevada a cabo en condiciones habituales, estando presente en la ejecución de cualquier actividad física, referida a la capacidad para producir el efecto deseado, y está relacionada con la e$cacia, entendida como el efecto de una acción llevada a cabo en las mejores condiciones posibles, y que tiene como objetivo, lograr la meta, o conseguir el triunfo. El objetivo de este trabajo consistió en identificar la relación entre la zona y el tipo de golpe, desde la cual el tenista presenta mayor y menor efectividad en el juego. Para ello se observó a un tenista durante 12 entrenamientos con un rival de nivel equivalente, según la ATP, durante la temporada 2012-2013, registrando su situación en la cancha y el tipo de golpe de todas las devoluciones con éxito, entendido como obtención del punto o recuperación del saque. Se crearon tres criterios categóricos que constituyen un instrumento de observación para registrar el juego del tenista en la zona horizontal, y la zona vertical de la pista, además del tipo de golpe que realiza en términos de drive, revés, smash y dejada. Utilizando la técnica de regresión log-lineal, se obtuvieron resultados que indican que el jugador presenta una menor efectividad en los golpes realizados desde el lado izquierdo, y muestra una mayor efectividad en el drive y revés ejecutados desde media pista o fondo del lado derecho. La interpretación de los resultados aporta información sobre las localizaciones en la pista y los golpes, relacionados con su mayor y menor efectividad.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Considerou-se o ajustamento de equações de regressão não-linear e o teste da razão de verossimilhança, com aproximações pelas estatísticas qui-quadrado e F, para testar as hipóteses de igualdade de qualquer subconjunto de parâmetros e de identidade dos modelos para dados com repetições provenientes de experimento com delineamento em blocos completos casualizados. Concluiu-se que as duas aproximações podem ser utilizadas, mas a aproximação pela estatística F deve ser preferida, principalmente para pequenas amostras.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Mestre em Engenharia Civil - Perfil Estruturas

Relevância:

40.00% 40.00%

Publicador:

Resumo:

O objetivo deste trabalho foi comparar as estimativas de parâmetros genéticos obtidas em análises bayesianas uni-característica e bi-característica, em modelo animal linear e de limiar, considerando-se as características categóricas morfológicas de bovinos da raça Nelore. Os dados de musculosidade, estrutura física e conformação foram obtidos entre 2000 e 2005, em 3.864 animais de 13 fazendas participantes do Programa Nelore Brasil. Foram realizadas análises bayesianas uni e bi-características, em modelos de limiar e linear. De modo geral, os modelos de limiar e linear foram eficientes na estimação dos parâmetros genéticos para escores visuais em análises bayesianas uni-características. Nas análises bi-características, observou-se que: com utilização de dados contínuos e categóricos, o modelo de limiar proporcionou estimativas de correlação genética de maior magnitude do que aquelas do modelo linear; e com o uso de dados categóricos, as estimativas de herdabilidade foram semelhantes. A vantagem do modelo linear foi o menor tempo gasto no processamento das análises. Na avaliação genética de animais para escores visuais, o uso do modelo de limiar ou linear não influenciou a classificação dos animais, quanto aos valores genéticos preditos, o que indica que ambos os modelos podem ser utilizados em programas de melhoramento genético.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

O objetivo deste trabalho foi comparar as estimativas de parâmetros genéticos obtidas em análises bayesianas uni-característica e bi-característica, em modelo animal linear e de limiar, considerando-se as características categóricas morfológicas de bovinos da raça Nelore. Os dados de musculosidade, estrutura física e conformação foram obtidos entre 2000 e 2005, em 3.864 animais de 13 fazendas participantes do Programa Nelore Brasil. Foram realizadas análises bayesianas uni e bi-características, em modelos de limiar e linear. de modo geral, os modelos de limiar e linear foram eficientes na estimação dos parâmetros genéticos para escores visuais em análises bayesianas uni-características. Nas análises bi-características, observou-se que: com utilização de dados contínuos e categóricos, o modelo de limiar proporcionou estimativas de correlação genética de maior magnitude do que aquelas do modelo linear; e com o uso de dados categóricos, as estimativas de herdabilidade foram semelhantes. A vantagem do modelo linear foi o menor tempo gasto no processamento das análises. Na avaliação genética de animais para escores visuais, o uso do modelo de limiar ou linear não influenciou a classificação dos animais, quanto aos valores genéticos preditos, o que indica que ambos os modelos podem ser utilizados em programas de melhoramento genético.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Estimaram-se as correlações genéticas entre os escores visuais e as características reprodutivas, utilizando a estatística bayesiana sob modelo animal linear-limiar, em bovinos da raça Nelore. Foram estudadas características categóricas morfológicas, avaliadas visualmente aos oito, 15 e 22 meses de idade; e características contínuas de perímetro escrotal padronizado aos 365 e 450 dias de idade, além da idade ao primeiro parto. As estimativas de correlações genéticas foram de sentido favorável à seleção, apresentando magnitudes moderadas, sugerindo que a seleção de animais para um biótipo desejável pode levar a animais com maior fertilidade e precocidade sexual. As estimativas de correlação genética para o perímetro escrotal padronizado aos 450 dias e a idade ao primeiro parto com as características morfológicas avaliadas aos 22 meses de idade foram maiores do que as obtidas entre as características de escores visuais avaliadas aos oito e 15 meses de idade. A utilização de escores visuais como critério de seleção trará progresso genético também para as características reprodutivas.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Objetivou-se com este trabalho, desenvolver modelos de programação não-linear para sistematização de terras, aplicáveis para áreas com formato regular e que minimizem a movimentação de terra, utilizando o software GAMS para o cálculo. Esses modelos foram comparados com o Método dos Quadrados Mínimos Generalizado, desenvolvido por Scaloppi & Willardson (1986), sendo o parâmetro de avaliação o volume de terra movimentado. Concluiu-se que, ambos os modelos de programação não-linear desenvolvidos nesta pesquisa mostraram-se adequados para aplicação em áreas regulares e forneceram menores valores de movimentação de terra quando comparados com o método dos quadrados mínimos.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)