979 resultados para Multivariate generalized t -distribution
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
Mixtures of Zellner's g-priors have been studied extensively in linear models and have been shown to have numerous desirable properties for Bayesian variable selection and model averaging. Several extensions of g-priors to Generalized Linear Models (GLMs) have been proposed in the literature; however, the choice of prior distribution of g and resulting properties for inference have received considerably less attention. In this paper, we extend mixtures of g-priors to GLMs by assigning the truncated Compound Confluent Hypergeometric (tCCH) distribution to 1/(1+g) and illustrate how this prior distribution encompasses several special cases of mixtures of g-priors in the literature, such as the Hyper-g, truncated Gamma, Beta-prime, and the Robust prior. Under an integrated Laplace approximation to the likelihood, the posterior distribution of 1/(1+g) is in turn a tCCH distribution, and approximate marginal likelihoods are thus available analytically. We discuss the local geometric properties of the g-prior in GLMs and show that specific choices of the hyper-parameters satisfy the various desiderata for model selection proposed by Bayarri et al, such as asymptotic model selection consistency, information consistency, intrinsic consistency, and measurement invariance. We also illustrate inference using these priors and contrast them to others in the literature via simulation and real examples.
Resumo:
Recent coccoliths from 52 surface sediment samples recovered from the south-eastern South Atlantic were examined qualitatively and quantitatively in order to assess the controlling mechanisms for their distribution patterns, such as ecological and preservational factors, and their role as carbonate producers. Total coccolith abundances range from 0.2 to 39.9 coccoliths*10**9/ g sediment. Four assemblages can be delineated by their coccolith content characterising the northern Benguela, the middle to southern Benguela, the Walvis Ridge and the deeper water. Distinctions are based on multivariate ordination techniques applied on the relative abundances of the most abundant taxa, Emiliania huxleyi, Calcidiscus leptoporus, Gephyrocapsa spp., Coccolithus pelagicus and subtropical to tropical species. The coccolith distribution seems to be temperature and nutrient controlled co-varying with the seaward extension of the upwelling filament zone in the Benguela. A preservation index (CEX') based on the differential dissolution behaviour of the delicate E. huxleyi and Gephyrocapsa ericsonii versus the robust C. leptoporus is applied in order to detect the position of the coccolith lysocline. Although some samples were recognised as dissolution-affected, the distribution of the coccoliths in the surface-sediments reflects the different oceanographic surface-water conditions. Mass estimations of the coccolith carbonate reveal coccoliths to be only minor contributors to the carbonate preserved in the surface sediments. The mean computed coccolith carbonate content is 17 wt.%, equivalent to a mean contribution of 23% to the bulk carbonate.
Resumo:
In this study we propose the use of the performance measure distribution rather than its punctual value to rank hedge funds. Generalized Sharpe Ratio and other similar measures that take into account the higher-order moments of portfolio return distributions are commonly used to evaluate hedge funds performance. The literature in this field has reported non-significant difference in ranking between performance measures that take, and those that do not take, into account higher moments of distribution. Our approach provides a much more powerful manner to differentiate between hedge funds performance. We use a non-semiparametric density based on Gram-Charlier expansions to forecast the conditional distribution of hedge fund returns and its corresponding performance measure distribution. Through a forecasting exercise we show the advantages of our technique in relation to using the more traditional punctual performance measures.
Resumo:
Thesis (Master's)--University of Washington, 2016-08
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
This dissertation proposes statistical methods to formulate, estimate and apply complex transportation models. Two main problems are part of the analyses conducted and presented in this dissertation. The first method solves an econometric problem and is concerned with the joint estimation of models that contain both discrete and continuous decision variables. The use of ordered models along with a regression is proposed and their effectiveness is evaluated with respect to unordered models. Procedure to calculate and optimize the log-likelihood functions of both discrete-continuous approaches are derived, and difficulties associated with the estimation of unordered models explained. Numerical approximation methods based on the Genz algortithm are implemented in order to solve the multidimensional integral associated with the unordered modeling structure. The problems deriving from the lack of smoothness of the probit model around the maximum of the log-likelihood function, which makes the optimization and the calculation of standard deviations very difficult, are carefully analyzed. A methodology to perform out-of-sample validation in the context of a joint model is proposed. Comprehensive numerical experiments have been conducted on both simulated and real data. In particular, the discrete-continuous models are estimated and applied to vehicle ownership and use models on data extracted from the 2009 National Household Travel Survey. The second part of this work offers a comprehensive statistical analysis of free-flow speed distribution; the method is applied to data collected on a sample of roads in Italy. A linear mixed model that includes speed quantiles in its predictors is estimated. Results show that there is no road effect in the analysis of free-flow speeds, which is particularly important for model transferability. A very general framework to predict random effects with few observations and incomplete access to model covariates is formulated and applied to predict the distribution of free-flow speed quantiles. The speed distribution of most road sections is successfully predicted; jack-knife estimates are calculated and used to explain why some sections are poorly predicted. Eventually, this work contributes to the literature in transportation modeling by proposing econometric model formulations for discrete-continuous variables, more efficient methods for the calculation of multivariate normal probabilities, and random effects models for free-flow speed estimation that takes into account the survey design. All methods are rigorously validated on both real and simulated data.
Resumo:
O prognóstico da perda dentária é um dos principais problemas na prática clínica de medicina dentária. Um dos principais fatores prognósticos é a quantidade de suporte ósseo do dente, definido pela área da superfície radicular dentária intraóssea. A estimação desta grandeza tem sido realizada por diferentes metodologias de investigação com resultados heterogéneos. Neste trabalho utilizamos o método da planimetria com microtomografia para calcular a área da superfície radicular (ASR) de uma amostra de cinco dentes segundos pré-molares inferiores obtida da população portuguesa, com o objetivo final de criar um modelo estatístico para estimar a área de superfície radicular intraóssea a partir de indicadores clínicos da perda óssea. Por fim propomos um método para aplicar os resultados na prática. Os dados referentes à área da superfície radicular, comprimento total do dente (CT) e dimensão mésio-distal máxima da coroa (MDeq) serviram para estabelecer as relações estatísticas entre variáveis e definir uma distribuição normal multivariada. Por fim foi criada uma amostra de 37 observações simuladas a partir da distribuição normal multivariada definida e estatisticamente idênticas aos dados da amostra de cinco dentes. Foram ajustados cinco modelos lineares generalizados aos dados simulados. O modelo estatístico foi selecionado segundo os critérios de ajustamento, preditibilidade, potência estatística, acurácia dos parâmetros e da perda de informação, e validado pela análise gráfica de resíduos. Apoiados nos resultados propomos um método em três fases para estimação área de superfície radicular perdida/remanescente. Na primeira fase usamos o modelo estatístico para estimar a área de superfície radicular, na segunda estimamos a proporção (decis) de raiz intraóssea usando uma régua de Schei adaptada e na terceira multiplicamos o valor obtido na primeira fase por um coeficiente que representa a proporção de raiz perdida (ASRp) ou da raiz remanescente (ASRr) para o decil estimado na segunda fase. O ponto forte deste estudo foi a aplicação de metodologia estatística validada para operacionalizar dados clínicos na estimação de suporte ósseo perdido. Como pontos fracos consideramos a aplicação destes resultados apenas aos segundos pré-molares mandibulares e a falta de validação clínica.
Resumo:
The objective of this study was to estimate the spatial distribution of work accident risk in the informal work market in the urban zone of an industrialized city in southeast Brazil and to examine concomitant effects of age, gender, and type of occupation after controlling for spatial risk variation. The basic methodology adopted was that of a population-based case-control study with particular interest focused on the spatial location of work. Cases were all casual workers in the city suffering work accidents during a one-year period; controls were selected from the source population of casual laborers by systematic random sampling of urban homes. The spatial distribution of work accidents was estimated via a semiparametric generalized additive model with a nonparametric bidimensional spline of the geographical coordinates of cases and controls as the nonlinear spatial component, and including age, gender, and occupation as linear predictive variables in the parametric component. We analyzed 1,918 cases and 2,245 controls between 1/11/2003 and 31/10/2004 in Piracicaba, Brazil. Areas of significantly high and low accident risk were identified in relation to mean risk in the study region (p < 0.01). Work accident risk for informal workers varied significantly in the study area. Significant age, gender, and occupational group effects on accident risk were identified after correcting for this spatial variation. A good understanding of high-risk groups and high-risk regions underpins the formulation of hypotheses concerning accident causality and the development of effective public accident prevention policies.
Resumo:
Multivariate normal distribution is commonly encountered in any field, a frequent issue is the missing values in practice. The purpose of this research was to estimate the parameters in three-dimensional covariance permutation-symmetric normal distribution with complete data and all possible patterns of incomplete data. In this study, MLE with missing data were derived, and the properties of the MLE as well as the sampling distributions were obtained. A Monte Carlo simulation study was used to evaluate the performance of the considered estimators for both cases when ρ was known and unknown. All results indicated that, compared to estimators in the case of omitting observations with missing data, the estimators derived in this article led to better performance. Furthermore, when ρ was unknown, using the estimate of ρ would lead to the same conclusion.
Resumo:
This thesis builds a framework for evaluating downside risk from multivariate data via a special class of risk measures (RM). The peculiarity of the analysis lies in getting rid of strong data distributional assumptions and in orientation towards the most critical data in risk management: those with asymmetries and heavy tails. At the same time, under typical assumptions, such as the ellipticity of the data probability distribution, the conformity with classical methods is shown. The constructed class of RM is a multivariate generalization of the coherent distortion RM, which possess valuable properties for a risk manager. The design of the framework is twofold. The first part contains new computational geometry methods for the high-dimensional data. The developed algorithms demonstrate computability of geometrical concepts used for constructing the RM. These concepts bring visuality and simplify interpretation of the RM. The second part develops models for applying the framework to actual problems. The spectrum of applications varies from robust portfolio selection up to broader spheres, such as stochastic conic optimization with risk constraints or supervised machine learning.
Resumo:
Taxonomic distinction to species level of deep water sharks is complex and often impossible to achieve during fisheries-related studies. The species of the genus Etmopterus are particularly difficult to identify, so they often appear without species assignation as Etmopetrus sp. or spp. in studies, even those focusing on elasmobranchs. During this work, the morphometric traits of two species of Etmopterus, E. spinax and E. pusillus were studied using 27 different morphological measurements, relatively easy to obtain even in the field. These measurements were processed with multivariate analysis in order to find out the most important ones likely to separate the two species. Sexual dimorphism was also assessed using the same techniques, and it was found that it does not occur in these species. The two Etmopterus species presented in this study share the same habitats in the overlapping ranges of distribution and are caught together on the outer shelves and slopes of the north-eastern Atlantic.
Resumo:
Supplementary feeding is a widespread game management practice in several red deer (Cervus elaphus) populations, with important potential consequences on the biology of this species. InMediterranean ecosystems food supplementation occurs in the rutting period, when it may change mating system characteristics. We studied the role of food supplementation relative to natural resources in the spatial distribution, aggregation, and mean harem size of females in Iberian red deer (Cervus elaphus hispanicus) during the rut. We studied 30 red deer populations of southwestern Spain, 63% of which experienced supplementary feeding. Using multivariate spatial analyses we found that food supplementation affected distribution of females in 95% of the populations in which it occurred. Green meadows present during the mating season acted as an important natural resource influencing female distribution. Additionally, the level of female aggregation and mean harem size were significantly higher in those populations in which food supplementation determined female distribution than in populations in which female distribution did not depend on supplementary feeding. Because female aggregation and mean harem size are key elements in sexual selection, supplementary feeding may constitute an important anthropogenic element with potential evolutionary implications for populations of Iberian red deer.
Resumo:
Although various abutment connections and materials have recently been introduced, insufficient data exist regarding the effect of stress distribution on their mechanical performance. The purpose of this study was to investigate the effect of different abutment materials and platform connections on stress distribution in single anterior implant-supported restorations with the finite element method. Nine experimental groups were modeled from the combination of 3 platform connections (external hexagon, internal hexagon, and Morse tapered) and 3 abutment materials (titanium, zirconia, and hybrid) as follows: external hexagon-titanium, external hexagon-zirconia, external hexagon-hybrid, internal hexagon-titanium, internal hexagon-zirconia, internal hexagon-hybrid, Morse tapered-titanium, Morse tapered-zirconia, and Morse tapered-hybrid. Finite element models consisted of a 4×13-mm implant, anatomic abutment, and lithium disilicate central incisor crown cemented over the abutment. The 49 N occlusal loading was applied in 6 steps to simulate the incisal guidance. Equivalent von Mises stress (σvM) was used for both the qualitative and quantitative evaluation of the implant and abutment in all the groups and the maximum (σmax) and minimum (σmin) principal stresses for the numerical comparison of the zirconia parts. The highest abutment σvM occurred in the Morse-tapered groups and the lowest in the external hexagon-hybrid, internal hexagon-titanium, and internal hexagon-hybrid groups. The σmax and σmin values were lower in the hybrid groups than in the zirconia groups. The stress distribution concentrated in the abutment-implant interface in all the groups, regardless of the platform connection or abutment material. The platform connection influenced the stress on abutments more than the abutment material. The stress values for implants were similar among different platform connections, but greater stress concentrations were observed in internal connections.