989 resultados para Statistical Theory
Resumo:
Recurrent event data are largely characterized by the rate function but smoothing techniques for estimating the rate function have never been rigorously developed or studied in statistical literature. This paper considers the moment and least squares methods for estimating the rate function from recurrent event data. With an independent censoring assumption on the recurrent event process, we study statistical properties of the proposed estimators and propose bootstrap procedures for the bandwidth selection and for the approximation of confidence intervals in the estimation of the occurrence rate function. It is identified that the moment method without resmoothing via a smaller bandwidth will produce curve with nicks occurring at the censoring times, whereas there is no such problem with the least squares method. Furthermore, the asymptotic variance of the least squares estimator is shown to be smaller under regularity conditions. However, in the implementation of the bootstrap procedures, the moment method is computationally more efficient than the least squares method because the former approach uses condensed bootstrap data. The performance of the proposed procedures is studied through Monte Carlo simulations and an epidemiological example on intravenous drug users.
Resumo:
In 2011, there will be an estimated 1,596,670 new cancer cases and 571,950 cancer-related deaths in the US. With the ever-increasing applications of cancer genetics in epidemiology, there is great potential to identify genetic risk factors that would help identify individuals with increased genetic susceptibility to cancer, which could be used to develop interventions or targeted therapies that could hopefully reduce cancer risk and mortality. In this dissertation, I propose to develop a new statistical method to evaluate the role of haplotypes in cancer susceptibility and development. This model will be flexible enough to handle not only haplotypes of any size, but also a variety of covariates. I will then apply this method to three cancer-related data sets (Hodgkin Disease, Glioma, and Lung Cancer). I hypothesize that there is substantial improvement in the estimation of association between haplotypes and disease, with the use of a Bayesian mathematical method to infer haplotypes that uses prior information from known genetics sources. Analysis based on haplotypes using information from publically available genetic sources generally show increased odds ratios and smaller p-values in both the Hodgkin, Glioma, and Lung data sets. For instance, the Bayesian Joint Logistic Model (BJLM) inferred haplotype TC had a substantially higher estimated effect size (OR=12.16, 95% CI = 2.47-90.1 vs. 9.24, 95% CI = 1.81-47.2) and more significant p-value (0.00044 vs. 0.008) for Hodgkin Disease compared to a traditional logistic regression approach. Also, the effect sizes of haplotypes modeled with recessive genetic effects were higher (and had more significant p-values) when analyzed with the BJLM. Full genetic models with haplotype information developed with the BJLM resulted in significantly higher discriminatory power and a significantly higher Net Reclassification Index compared to those developed with haplo.stats for lung cancer. Future analysis for this work could be to incorporate the 1000 Genomes project, which offers a larger selection of SNPs can be incorporated into the information from known genetic sources as well. Other future analysis include testing non-binary outcomes, like the levels of biomarkers that are present in lung cancer (NNK), and extending this analysis to full GWAS studies.
Resumo:
Neuronal morphology is a key feature in the study of brain circuits, as it is highly related to information processing and functional identification. Neuronal morphology affects the process of integration of inputs from other neurons and determines the neurons which receive the output of the neurons. Different parts of the neurons can operate semi-independently according to the spatial location of the synaptic connections. As a result, there is considerable interest in the analysis of the microanatomy of nervous cells since it constitutes an excellent tool for better understanding cortical function. However, the morphologies, molecular features and electrophysiological properties of neuronal cells are extremely variable. Except for some special cases, this variability makes it hard to find a set of features that unambiguously define a neuronal type. In addition, there are distinct types of neurons in particular regions of the brain. This morphological variability makes the analysis and modeling of neuronal morphology a challenge. Uncertainty is a key feature in many complex real-world problems. Probability theory provides a framework for modeling and reasoning with uncertainty. Probabilistic graphical models combine statistical theory and graph theory to provide a tool for managing domains with uncertainty. In particular, we focus on Bayesian networks, the most commonly used probabilistic graphical model. In this dissertation, we design new methods for learning Bayesian networks and apply them to the problem of modeling and analyzing morphological data from neurons. The morphology of a neuron can be quantified using a number of measurements, e.g., the length of the dendrites and the axon, the number of bifurcations, the direction of the dendrites and the axon, etc. These measurements can be modeled as discrete or continuous data. The continuous data can be linear (e.g., the length or the width of a dendrite) or directional (e.g., the direction of the axon). These data may follow complex probability distributions and may not fit any known parametric distribution. Modeling this kind of problems using hybrid Bayesian networks with discrete, linear and directional variables poses a number of challenges regarding learning from data, inference, etc. In this dissertation, we propose a method for modeling and simulating basal dendritic trees from pyramidal neurons using Bayesian networks to capture the interactions between the variables in the problem domain. A complete set of variables is measured from the dendrites, and a learning algorithm is applied to find the structure and estimate the parameters of the probability distributions included in the Bayesian networks. Then, a simulation algorithm is used to build the virtual dendrites by sampling values from the Bayesian networks, and a thorough evaluation is performed to show the model’s ability to generate realistic dendrites. In this first approach, the variables are discretized so that discrete Bayesian networks can be learned and simulated. Then, we address the problem of learning hybrid Bayesian networks with different kinds of variables. Mixtures of polynomials have been proposed as a way of representing probability densities in hybrid Bayesian networks. We present a method for learning mixtures of polynomials approximations of one-dimensional, multidimensional and conditional probability densities from data. The method is based on basis spline interpolation, where a density is approximated as a linear combination of basis splines. The proposed algorithms are evaluated using artificial datasets. We also use the proposed methods as a non-parametric density estimation technique in Bayesian network classifiers. Next, we address the problem of including directional data in Bayesian networks. These data have some special properties that rule out the use of classical statistics. Therefore, different distributions and statistics, such as the univariate von Mises and the multivariate von Mises–Fisher distributions, should be used to deal with this kind of information. In particular, we extend the naive Bayes classifier to the case where the conditional probability distributions of the predictive variables given the class follow either of these distributions. We consider the simple scenario, where only directional predictive variables are used, and the hybrid case, where discrete, Gaussian and directional distributions are mixed. The classifier decision functions and their decision surfaces are studied at length. Artificial examples are used to illustrate the behavior of the classifiers. The proposed classifiers are empirically evaluated over real datasets. We also study the problem of interneuron classification. An extensive group of experts is asked to classify a set of neurons according to their most prominent anatomical features. A web application is developed to retrieve the experts’ classifications. We compute agreement measures to analyze the consensus between the experts when classifying the neurons. Using Bayesian networks and clustering algorithms on the resulting data, we investigate the suitability of the anatomical terms and neuron types commonly used in the literature. Additionally, we apply supervised learning approaches to automatically classify interneurons using the values of their morphological measurements. Then, a methodology for building a model which captures the opinions of all the experts is presented. First, one Bayesian network is learned for each expert, and we propose an algorithm for clustering Bayesian networks corresponding to experts with similar behaviors. Then, a Bayesian network which represents the opinions of each group of experts is induced. Finally, a consensus Bayesian multinet which models the opinions of the whole group of experts is built. A thorough analysis of the consensus model identifies different behaviors between the experts when classifying the interneurons in the experiment. A set of characterizing morphological traits for the neuronal types can be defined by performing inference in the Bayesian multinet. These findings are used to validate the model and to gain some insights into neuron morphology. Finally, we study a classification problem where the true class label of the training instances is not known. Instead, a set of class labels is available for each instance. This is inspired by the neuron classification problem, where a group of experts is asked to individually provide a class label for each instance. We propose a novel approach for learning Bayesian networks using count vectors which represent the number of experts who selected each class label for each instance. These Bayesian networks are evaluated using artificial datasets from supervised learning problems. Resumen La morfología neuronal es una característica clave en el estudio de los circuitos cerebrales, ya que está altamente relacionada con el procesado de información y con los roles funcionales. La morfología neuronal afecta al proceso de integración de las señales de entrada y determina las neuronas que reciben las salidas de otras neuronas. Las diferentes partes de la neurona pueden operar de forma semi-independiente de acuerdo a la localización espacial de las conexiones sinápticas. Por tanto, existe un interés considerable en el análisis de la microanatomía de las células nerviosas, ya que constituye una excelente herramienta para comprender mejor el funcionamiento de la corteza cerebral. Sin embargo, las propiedades morfológicas, moleculares y electrofisiológicas de las células neuronales son extremadamente variables. Excepto en algunos casos especiales, esta variabilidad morfológica dificulta la definición de un conjunto de características que distingan claramente un tipo neuronal. Además, existen diferentes tipos de neuronas en regiones particulares del cerebro. La variabilidad neuronal hace que el análisis y el modelado de la morfología neuronal sean un importante reto científico. La incertidumbre es una propiedad clave en muchos problemas reales. La teoría de la probabilidad proporciona un marco para modelar y razonar bajo incertidumbre. Los modelos gráficos probabilísticos combinan la teoría estadística y la teoría de grafos con el objetivo de proporcionar una herramienta con la que trabajar bajo incertidumbre. En particular, nos centraremos en las redes bayesianas, el modelo más utilizado dentro de los modelos gráficos probabilísticos. En esta tesis hemos diseñado nuevos métodos para aprender redes bayesianas, inspirados por y aplicados al problema del modelado y análisis de datos morfológicos de neuronas. La morfología de una neurona puede ser cuantificada usando una serie de medidas, por ejemplo, la longitud de las dendritas y el axón, el número de bifurcaciones, la dirección de las dendritas y el axón, etc. Estas medidas pueden ser modeladas como datos continuos o discretos. A su vez, los datos continuos pueden ser lineales (por ejemplo, la longitud o la anchura de una dendrita) o direccionales (por ejemplo, la dirección del axón). Estos datos pueden llegar a seguir distribuciones de probabilidad muy complejas y pueden no ajustarse a ninguna distribución paramétrica conocida. El modelado de este tipo de problemas con redes bayesianas híbridas incluyendo variables discretas, lineales y direccionales presenta una serie de retos en relación al aprendizaje a partir de datos, la inferencia, etc. En esta tesis se propone un método para modelar y simular árboles dendríticos basales de neuronas piramidales usando redes bayesianas para capturar las interacciones entre las variables del problema. Para ello, se mide un amplio conjunto de variables de las dendritas y se aplica un algoritmo de aprendizaje con el que se aprende la estructura y se estiman los parámetros de las distribuciones de probabilidad que constituyen las redes bayesianas. Después, se usa un algoritmo de simulación para construir dendritas virtuales mediante el muestreo de valores de las redes bayesianas. Finalmente, se lleva a cabo una profunda evaluaci ón para verificar la capacidad del modelo a la hora de generar dendritas realistas. En esta primera aproximación, las variables fueron discretizadas para poder aprender y muestrear las redes bayesianas. A continuación, se aborda el problema del aprendizaje de redes bayesianas con diferentes tipos de variables. Las mixturas de polinomios constituyen un método para representar densidades de probabilidad en redes bayesianas híbridas. Presentamos un método para aprender aproximaciones de densidades unidimensionales, multidimensionales y condicionales a partir de datos utilizando mixturas de polinomios. El método se basa en interpolación con splines, que aproxima una densidad como una combinación lineal de splines. Los algoritmos propuestos se evalúan utilizando bases de datos artificiales. Además, las mixturas de polinomios son utilizadas como un método no paramétrico de estimación de densidades para clasificadores basados en redes bayesianas. Después, se estudia el problema de incluir información direccional en redes bayesianas. Este tipo de datos presenta una serie de características especiales que impiden el uso de las técnicas estadísticas clásicas. Por ello, para manejar este tipo de información se deben usar estadísticos y distribuciones de probabilidad específicos, como la distribución univariante von Mises y la distribución multivariante von Mises–Fisher. En concreto, en esta tesis extendemos el clasificador naive Bayes al caso en el que las distribuciones de probabilidad condicionada de las variables predictoras dada la clase siguen alguna de estas distribuciones. Se estudia el caso base, en el que sólo se utilizan variables direccionales, y el caso híbrido, en el que variables discretas, lineales y direccionales aparecen mezcladas. También se estudian los clasificadores desde un punto de vista teórico, derivando sus funciones de decisión y las superficies de decisión asociadas. El comportamiento de los clasificadores se ilustra utilizando bases de datos artificiales. Además, los clasificadores son evaluados empíricamente utilizando bases de datos reales. También se estudia el problema de la clasificación de interneuronas. Desarrollamos una aplicación web que permite a un grupo de expertos clasificar un conjunto de neuronas de acuerdo a sus características morfológicas más destacadas. Se utilizan medidas de concordancia para analizar el consenso entre los expertos a la hora de clasificar las neuronas. Se investiga la idoneidad de los términos anatómicos y de los tipos neuronales utilizados frecuentemente en la literatura a través del análisis de redes bayesianas y la aplicación de algoritmos de clustering. Además, se aplican técnicas de aprendizaje supervisado con el objetivo de clasificar de forma automática las interneuronas a partir de sus valores morfológicos. A continuación, se presenta una metodología para construir un modelo que captura las opiniones de todos los expertos. Primero, se genera una red bayesiana para cada experto y se propone un algoritmo para agrupar las redes bayesianas que se corresponden con expertos con comportamientos similares. Después, se induce una red bayesiana que modela la opinión de cada grupo de expertos. Por último, se construye una multired bayesiana que modela las opiniones del conjunto completo de expertos. El análisis del modelo consensuado permite identificar diferentes comportamientos entre los expertos a la hora de clasificar las neuronas. Además, permite extraer un conjunto de características morfológicas relevantes para cada uno de los tipos neuronales mediante inferencia con la multired bayesiana. Estos descubrimientos se utilizan para validar el modelo y constituyen información relevante acerca de la morfología neuronal. Por último, se estudia un problema de clasificación en el que la etiqueta de clase de los datos de entrenamiento es incierta. En cambio, disponemos de un conjunto de etiquetas para cada instancia. Este problema está inspirado en el problema de la clasificación de neuronas, en el que un grupo de expertos proporciona una etiqueta de clase para cada instancia de manera individual. Se propone un método para aprender redes bayesianas utilizando vectores de cuentas, que representan el número de expertos que seleccionan cada etiqueta de clase para cada instancia. Estas redes bayesianas se evalúan utilizando bases de datos artificiales de problemas de aprendizaje supervisado.
Resumo:
Tranformed-rule up and down psychophysical methods have gained great popularity, mainly because they combine criterion-free responses with an adaptive procedure allowing rapid determination of an average stimulus threshold at various criterion levels of correct responses. The statistical theory underlying the methods now in routine use is based on sets of consecutive responses with assumed constant probabilities of occurrence. The response rules requiring consecutive responses prevent the possibility of using the most desirable response criterion, that of 75% correct responses. The earliest transformed-rule up and down method, whose rules included nonconsecutive responses, did not contain this limitation but failed to become generally accepted, lacking a published theoretical foundation. Such a foundation is provided in this article and is validated empirically with the help of experiments on human subjects and a computer simulation. In addition to allowing the criterion of 75% correct responses, the method is more efficient than the methods excluding nonconsecutive responses in their rules.
Resumo:
We present a novel method, called the transform likelihood ratio (TLR) method, for estimation of rare event probabilities with heavy-tailed distributions. Via a simple transformation ( change of variables) technique the TLR method reduces the original rare event probability estimation with heavy tail distributions to an equivalent one with light tail distributions. Once this transformation has been established we estimate the rare event probability via importance sampling, using the classical exponential change of measure or the standard likelihood ratio change of measure. In the latter case the importance sampling distribution is chosen from the same parametric family as the transformed distribution. We estimate the optimal parameter vector of the importance sampling distribution using the cross-entropy method. We prove the polynomial complexity of the TLR method for certain heavy-tailed models and demonstrate numerically its high efficiency for various heavy-tailed models previously thought to be intractable. We also show that the TLR method can be viewed as a universal tool in the sense that not only it provides a unified view for heavy-tailed simulation but also can be efficiently used in simulation with light-tailed distributions. We present extensive simulation results which support the efficiency of the TLR method.
Resumo:
It is shown that variance-balanced designs can be obtained from Type I orthogonal arrays for many general models with two kinds of treatment effects, including ones for interference, with general dependence structures. These designs can be used to obtain optimal and efficient designs. Some examples and design comparisons are given. (C) 2002 Elsevier B.V. All rights reserved.
Resumo:
In this article we investigate the asymptotic and finite-sample properties of predictors of regression models with autocorrelated errors. We prove new theorems associated with the predictive efficiency of generalized least squares (GLS) and incorrectly structured GLS predictors. We also establish the form associated with their predictive mean squared errors as well as the magnitude of these errors relative to each other and to those generated from the ordinary least squares (OLS) predictor. A large simulation study is used to evaluate the finite-sample performance of forecasts generated from models using different corrections for the serial correlation.
Resumo:
In this paper we propose a new identification method based on the residual white noise autoregressive criterion (Pukkila et al. , 1990) to select the order of VARMA structures. Results from extensive simulation experiments based on different model structures with varying number of observations and number of component series are used to demonstrate the performance of this new procedure. We also use economic and business data to compare the model structures selected by this order selection method with those identified in other published studies.
Resumo:
In this paper we investigate a Bayesian procedure for the estimation of a flexible generalised distribution, notably the MacGillivray adaptation of the g-and-κ distribution. This distribution, described through its inverse cdf or quantile function, generalises the standard normal through extra parameters which together describe skewness and kurtosis. The standard quantile-based methods for estimating the parameters of generalised distributions are often arbitrary and do not rely on computation of the likelihood. MCMC, however, provides a simulation-based alternative for obtaining the maximum likelihood estimates of parameters of these distributions or for deriving posterior estimates of the parameters through a Bayesian framework. In this paper we adopt the latter approach, The proposed methodology is illustrated through an application in which the parameter of interest is slightly skewed.
Resumo:
There are at least two reasons for a symmetric, unimodal, diffuse tailed hyperbolic secant distribution to be interesting in real-life applications. It displays one of the common types of non normality in natural data and is closely related to the logistic and Cauchy distributions that often arise in practice. To test the difference in location between two hyperbolic secant distributions, we develop a simple linear rank test with trigonometric scores. We investigate the small-sample and asymptotic properties of the test statistic and provide tables of the exact null distribution for small sample sizes. We compare the test to the Wilcoxon two-sample test and show that, although the asymptotic powers of the tests are comparable, the present test has certain practical advantages over the Wilcoxon test.
Resumo:
The cross-entropy (CE) method is a new generic approach to combinatorial and multi-extremal optimization and rare event simulation. The purpose of this tutorial is to give a gentle introduction to the CE method. We present the CE methodology, the basic algorithm and its modifications, and discuss applications in combinatorial optimization and machine learning. combinatorial optimization
Resumo:
Consider a network of unreliable links, modelling for example a communication network. Estimating the reliability of the network-expressed as the probability that certain nodes in the network are connected-is a computationally difficult task. In this paper we study how the Cross-Entropy method can be used to obtain more efficient network reliability estimation procedures. Three techniques of estimation are considered: Crude Monte Carlo and the more sophisticated Permutation Monte Carlo and Merge Process. We show that the Cross-Entropy method yields a speed-up over all three techniques.
Resumo:
The buffer allocation problem (BAP) is a well-known difficult problem in the design of production lines. We present a stochastic algorithm for solving the BAP, based on the cross-entropy method, a new paradigm for stochastic optimization. The algorithm involves the following iterative steps: (a) the generation of buffer allocations according to a certain random mechanism, followed by (b) the modification of this mechanism on the basis of cross-entropy minimization. Through various numerical experiments we demonstrate the efficiency of the proposed algorithm and show that the method can quickly generate (near-)optimal buffer allocations for fairly large production lines.
Resumo:
We consider the problem of estimating P(Yi + (...) + Y-n > x) by importance sampling when the Yi are i.i.d. and heavy-tailed. The idea is to exploit the cross-entropy method as a toot for choosing good parameters in the importance sampling distribution; in doing so, we use the asymptotic description that given P(Y-1 + (...) + Y-n > x), n - 1 of the Yi have distribution F and one the conditional distribution of Y given Y > x. We show in some specific parametric examples (Pareto and Weibull) how this leads to precise answers which, as demonstrated numerically, are close to being variance minimal within the parametric class under consideration. Related problems for M/G/l and GI/G/l queues are also discussed.
Resumo:
The generalized secant hyperbolic distribution (GSHD) proposed in Vaughan (2002) includes a wide range of unimodal symmetric distributions, with the Cauchy and uniform distributions being the limiting cases, and the logistic and hyperbolic secant distributions being special cases. The current article derives an asymptotically efficient rank estimator of the location parameter of the GSHD and suggests the corresponding one- and two-sample optimal rank tests. The rank estimator derived is compared to the modified MLE of location proposed in Vaughan (2002). By combining these two estimators, a computationally attractive method for constructing an exact confidence interval of the location parameter is developed. The statistical procedures introduced in the current article are illustrated by examples.