23 resultados para tilastotiede
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
Genetics, the science of heredity and variation in living organisms, has a central role in medicine, in breeding crops and livestock, and in studying fundamental topics of biological sciences such as evolution and cell functioning. Currently the field of genetics is under a rapid development because of the recent advances in technologies by which molecular data can be obtained from living organisms. In order that most information from such data can be extracted, the analyses need to be carried out using statistical models that are tailored to take account of the particular genetic processes. In this thesis we formulate and analyze Bayesian models for genetic marker data of contemporary individuals. The major focus is on the modeling of the unobserved recent ancestry of the sampled individuals (say, for tens of generations or so), which is carried out by using explicit probabilistic reconstructions of the pedigree structures accompanied by the gene flows at the marker loci. For such a recent history, the recombination process is the major genetic force that shapes the genomes of the individuals, and it is included in the model by assuming that the recombination fractions between the adjacent markers are known. The posterior distribution of the unobserved history of the individuals is studied conditionally on the observed marker data by using a Markov chain Monte Carlo algorithm (MCMC). The example analyses consider estimation of the population structure, relatedness structure (both at the level of whole genomes as well as at each marker separately), and haplotype configurations. For situations where the pedigree structure is partially known, an algorithm to create an initial state for the MCMC algorithm is given. Furthermore, the thesis includes an extension of the model for the recent genetic history to situations where also a quantitative phenotype has been measured from the contemporary individuals. In that case the goal is to identify positions on the genome that affect the observed phenotypic values. This task is carried out within the Bayesian framework, where the number and the relative effects of the quantitative trait loci are treated as random variables whose posterior distribution is studied conditionally on the observed genetic and phenotypic data. In addition, the thesis contains an extension of a widely-used haplotyping method, the PHASE algorithm, to settings where genetic material from several individuals has been pooled together, and the allele frequencies of each pool are determined in a single genotyping.
Resumo:
Microarrays are high throughput biological assays that allow the screening of thousands of genes for their expression. The main idea behind microarrays is to compute for each gene a unique signal that is directly proportional to the quantity of mRNA that was hybridized on the chip. A large number of steps and errors associated with each step make the generated expression signal noisy. As a result, microarray data need to be carefully pre-processed before their analysis can be assumed to lead to reliable and biologically relevant conclusions. This thesis focuses on developing methods for improving gene signal and further utilizing this improved signal for higher level analysis. To achieve this, first, approaches for designing microarray experiments using various optimality criteria, considering both biological and technical replicates, are described. A carefully designed experiment leads to signal with low noise, as the effect of unwanted variations is minimized and the precision of the estimates of the parameters of interest are maximized. Second, a system for improving the gene signal by using three scans at varying scanner sensitivities is developed. A novel Bayesian latent intensity model is then applied on these three sets of expression values, corresponding to the three scans, to estimate the suitably calibrated true signal of genes. Third, a novel image segmentation approach that segregates the fluorescent signal from the undesired noise is developed using an additional dye, SYBR green RNA II. This technique helped in identifying signal only with respect to the hybridized DNA, and signal corresponding to dust, scratch, spilling of dye, and other noises, are avoided. Fourth, an integrated statistical model is developed, where signal correction, systematic array effects, dye effects, and differential expression, are modelled jointly as opposed to a sequential application of several methods of analysis. The methods described in here have been tested only for cDNA microarrays, but can also, with some modifications, be applied to other high-throughput technologies. Keywords: High-throughput technology, microarray, cDNA, multiple scans, Bayesian hierarchical models, image analysis, experimental design, MCMC, WinBUGS.
Resumo:
Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.
Resumo:
This thesis addresses modeling of financial time series, especially stock market returns and daily price ranges. Modeling data of this kind can be approached with so-called multiplicative error models (MEM). These models nest several well known time series models such as GARCH, ACD and CARR models. They are able to capture many well established features of financial time series including volatility clustering and leptokurtosis. In contrast to these phenomena, different kinds of asymmetries have received relatively little attention in the existing literature. In this thesis asymmetries arise from various sources. They are observed in both conditional and unconditional distributions, for variables with non-negative values and for variables that have values on the real line. In the multivariate context asymmetries can be observed in the marginal distributions as well as in the relationships of the variables modeled. New methods for all these cases are proposed. Chapter 2 considers GARCH models and modeling of returns of two stock market indices. The chapter introduces the so-called generalized hyperbolic (GH) GARCH model to account for asymmetries in both conditional and unconditional distribution. In particular, two special cases of the GARCH-GH model which describe the data most accurately are proposed. They are found to improve the fit of the model when compared to symmetric GARCH models. The advantages of accounting for asymmetries are also observed through Value-at-Risk applications. Both theoretical and empirical contributions are provided in Chapter 3 of the thesis. In this chapter the so-called mixture conditional autoregressive range (MCARR) model is introduced, examined and applied to daily price ranges of the Hang Seng Index. The conditions for the strict and weak stationarity of the model as well as an expression for the autocorrelation function are obtained by writing the MCARR model as a first order autoregressive process with random coefficients. The chapter also introduces inverse gamma (IG) distribution to CARR models. The advantages of CARR-IG and MCARR-IG specifications over conventional CARR models are found in the empirical application both in- and out-of-sample. Chapter 4 discusses the simultaneous modeling of absolute returns and daily price ranges. In this part of the thesis a vector multiplicative error model (VMEM) with asymmetric Gumbel copula is found to provide substantial benefits over the existing VMEM models based on elliptical copulas. The proposed specification is able to capture the highly asymmetric dependence of the modeled variables thereby improving the performance of the model considerably. The economic significance of the results obtained is established when the information content of the volatility forecasts derived is examined.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
This thesis studies quantile residuals and uses different methodologies to develop test statistics that are applicable in evaluating linear and nonlinear time series models based on continuous distributions. Models based on mixtures of distributions are of special interest because it turns out that for those models traditional residuals, often referred to as Pearson's residuals, are not appropriate. As such models have become more and more popular in practice, especially with financial time series data there is a need for reliable diagnostic tools that can be used to evaluate them. The aim of the thesis is to show how such diagnostic tools can be obtained and used in model evaluation. The quantile residuals considered here are defined in such a way that, when the model is correctly specified and its parameters are consistently estimated, they are approximately independent with standard normal distribution. All the tests derived in the thesis are pure significance type tests and are theoretically sound in that they properly take the uncertainty caused by parameter estimation into account. -- In Chapter 2 a general framework based on the likelihood function and smooth functions of univariate quantile residuals is derived that can be used to obtain misspecification tests for various purposes. Three easy-to-use tests aimed at detecting non-normality, autocorrelation, and conditional heteroscedasticity in quantile residuals are formulated. It also turns out that these tests can be interpreted as Lagrange Multiplier or score tests so that they are asymptotically optimal against local alternatives. Chapter 3 extends the concept of quantile residuals to multivariate models. The framework of Chapter 2 is generalized and tests aimed at detecting non-normality, serial correlation, and conditional heteroscedasticity in multivariate quantile residuals are derived based on it. Score test interpretations are obtained for the serial correlation and conditional heteroscedasticity tests and in a rather restricted special case for the normality test. In Chapter 4 the tests are constructed using the empirical distribution function of quantile residuals. So-called Khmaladze s martingale transformation is applied in order to eliminate the uncertainty caused by parameter estimation. Various test statistics are considered so that critical bounds for histogram type plots as well as Quantile-Quantile and Probability-Probability type plots of quantile residuals are obtained. Chapters 2, 3, and 4 contain simulations and empirical examples which illustrate the finite sample size and power properties of the derived tests and also how the tests and related graphical tools based on residuals are applied in practice.
Resumo:
This study examines the properties of Generalised Regression (GREG) estimators for domain class frequencies and proportions. The family of GREG estimators forms the class of design-based model-assisted estimators. All GREG estimators utilise auxiliary information via modelling. The classic GREG estimator with a linear fixed effects assisting model (GREG-lin) is one example. But when estimating class frequencies, the study variable is binary or polytomous. Therefore logistic-type assisting models (e.g. logistic or probit model) should be preferred over the linear one. However, other GREG estimators than GREG-lin are rarely used, and knowledge about their properties is limited. This study examines the properties of L-GREG estimators, which are GREG estimators with fixed-effects logistic-type models. Three research questions are addressed. First, I study whether and when L-GREG estimators are more accurate than GREG-lin. Theoretical results and Monte Carlo experiments which cover both equal and unequal probability sampling designs and a wide variety of model formulations show that in standard situations, the difference between L-GREG and GREG-lin is small. But in the case of a strong assisting model, two interesting situations arise: if the domain sample size is reasonably large, L-GREG is more accurate than GREG-lin, and if the domain sample size is very small, estimation of assisting model parameters may be inaccurate, resulting in bias for L-GREG. Second, I study variance estimation for the L-GREG estimators. The standard variance estimator (S) for all GREG estimators resembles the Sen-Yates-Grundy variance estimator, but it is a double sum of prediction errors, not of the observed values of the study variable. Monte Carlo experiments show that S underestimates the variance of L-GREG especially if the domain sample size is minor, or if the assisting model is strong. Third, since the standard variance estimator S often fails for the L-GREG estimators, I propose a new augmented variance estimator (A). The difference between S and the new estimator A is that the latter takes into account the difference between the sample fit model and the census fit model. In Monte Carlo experiments, the new estimator A outperformed the standard estimator S in terms of bias, root mean square error and coverage rate. Thus the new estimator provides a good alternative to the standard estimator.
Resumo:
The resources of health systems are limited. There is a need for information concerning the performance of the health system for the purposes of decision-making. This study is about utilization of administrative registers in the context of health system performance evaluation. In order to address this issue, a multidisciplinary methodological framework for register-based data analysis is defined. Because the fixed structure of register-based data indirectly determines constraints on the theoretical constructs, it is essential to elaborate the whole analytic process with respect to the data. The fundamental methodological concepts and theories are synthesized into a data sensitive approach which helps to understand and overcome the problems that are likely to be encountered during a register-based data analyzing process. A pragmatically useful health system performance monitoring should produce valid information about the volume of the problems, about the use of services and about the effectiveness of provided services. A conceptual model for hip fracture performance assessment is constructed and the validity of Finnish registers as a data source for the purposes of performance assessment of hip fracture treatment is confirmed. Solutions to several pragmatic problems related to the development of a register-based hip fracture incidence surveillance system are proposed. The monitoring of effectiveness of treatment is shown to be possible in terms of care episodes. Finally, an example on the justification of a more detailed performance indicator to be used in the profiling of providers is given. In conclusion, it is possible to produce useful and valid information on health system performance by using Finnish register-based data. However, that seems to be far more complicated than is typically assumed. The perspectives given in this study introduce a necessary basis for further work and help in the routine implementation of a hip fracture monitoring system in Finland.
Resumo:
This thesis studies binary time series models and their applications in empirical macroeconomics and finance. In addition to previously suggested models, new dynamic extensions are proposed to the static probit model commonly used in the previous literature. In particular, we are interested in probit models with an autoregressive model structure. In Chapter 2, the main objective is to compare the predictive performance of the static and dynamic probit models in forecasting the U.S. and German business cycle recession periods. Financial variables, such as interest rates and stock market returns, are used as predictive variables. The empirical results suggest that the recession periods are predictable and dynamic probit models, especially models with the autoregressive structure, outperform the static model. Chapter 3 proposes a Lagrange Multiplier (LM) test for the usefulness of the autoregressive structure of the probit model. The finite sample properties of the LM test are considered with simulation experiments. Results indicate that the two alternative LM test statistics have reasonable size and power in large samples. In small samples, a parametric bootstrap method is suggested to obtain approximately correct size. In Chapter 4, the predictive power of dynamic probit models in predicting the direction of stock market returns are examined. The novel idea is to use recession forecast (see Chapter 2) as a predictor of the stock return sign. The evidence suggests that the signs of the U.S. excess stock returns over the risk-free return are predictable both in and out of sample. The new "error correction" probit model yields the best forecasts and it also outperforms other predictive models, such as ARMAX models, in terms of statistical and economic goodness-of-fit measures. Chapter 5 generalizes the analysis of univariate models considered in Chapters 2 4 to the case of a bivariate model. A new bivariate autoregressive probit model is applied to predict the current state of the U.S. business cycle and growth rate cycle periods. Evidence of predictability of both cycle indicators is obtained and the bivariate model is found to outperform the univariate models in terms of predictive power.
Resumo:
Markov random fields (MRF) are popular in image processing applications to describe spatial dependencies between image units. Here, we take a look at the theory and the models of MRFs with an application to improve forest inventory estimates. Typically, autocorrelation between study units is a nuisance in statistical inference, but we take an advantage of the dependencies to smooth noisy measurements by borrowing information from the neighbouring units. We build a stochastic spatial model, which we estimate with a Markov chain Monte Carlo simulation method. The smooth values are validated against another data set increasing our confidence that the estimates are more accurate than the originals.
Resumo:
The aim of this dissertation is to model economic variables by a mixture autoregressive (MAR) model. The MAR model is a generalization of linear autoregressive (AR) model. The MAR -model consists of K linear autoregressive components. At any given point of time one of these autoregressive components is randomly selected to generate a new observation for the time series. The mixture probability can be constant over time or a direct function of a some observable variable. Many economic time series contain properties which cannot be described by linear and stationary time series models. A nonlinear autoregressive model such as MAR model can a plausible alternative in the case of these time series. In this dissertation the MAR model is used to model stock market bubbles and a relationship between inflation and the interest rate. In the case of the inflation rate we arrived at the MAR model where inflation process is less mean reverting in the case of high inflation than in the case of normal inflation. The interest rate move one-for-one with expected inflation. We use the data from the Livingston survey as a proxy for inflation expectations. We have found that survey inflation expectations are not perfectly rational. According to our results information stickiness play an important role in the expectation formation. We also found that survey participants have a tendency to underestimate inflation. A MAR model has also used to model stock market bubbles and crashes. This model has two regimes: the bubble regime and the error correction regime. In the error correction regime price depends on a fundamental factor, the price-dividend ratio, and in the bubble regime, price is independent of fundamentals. In this model a stock market crash is usually caused by a regime switch from a bubble regime to an error-correction regime. According to our empirical results bubbles are related to a low inflation. Our model also imply that bubbles have influences investment return distribution in both short and long run.
Resumo:
Opinnäytetyössä perehdytään erilaisiin tilastollisiin menetelmiin, joilla voidaan analysoida lääkityksen vaikutusta skitsofreniapotilaiden kognitiiviseen suoriutumiseen. Analysoitava aineisto on osa laajaa perheaineistoa, joka kerättiin alun perin Terveyden ja hyvinvoinnin laitoksen tutkimusprojektia varten. Projektin tarkoituksena on selvittäävakavien mielenterveyshäiriöiden geneettistä epidemiologiaa. Keskeiset työssä käsiteltävät menetelmät ovat lineaarinen regressioanalyysi, faktorianalyysi ja rakenneyhtälömallinnus. Potilaiden kognitiivista suoriutumista on mitattu neuropsykologisella testipatteristolla, joka koostuu useasta eri testistä. Lääkityksen ja kognition välisiä yhteyksiä tutkitaan aluksi lineaaristen regressiomallien avulla, joissa lääkityksen vaikutusta jokaiseen kognitiotestiin arvioidaan erikseen. Testit kuitenkin korreloivat keskenään kohtalaisen voimakkaasti muodostaen erilaisia alaryhmiä. Analyyseissa sovelletaan täten myös rakenneyhtälömallia, jossa yksittäisten testimuuttujien sijaan tarkastellaan eräänlaisia laajempia kognitiota edustavia ulottuvuuksia. Toisaalta aineistossa voidaan ajatella olevan riippuvuutta myös havaintojen tasolla. Tutkimusaineisto on kerätty hyödyntäen perhetason otantaa, joten otoksessa saattaa olla useampi samaan perheeseen kuuluva henkilö. Tällaista monitasoista aineistoa ei suoraviivaisesti voida analysoida kaikkein yleisimmin käytetyillä tilastollisilla menetelmillä, jotka yleensä on tarkoitettu satunnaisotannalla kerätyn riippumattoman aineiston analyysiin. Monitasoisuus tullaan huomioimaan analyyseissa käyttäen ns. satunnaistekijä- ja marginaalimallinnusta. Tarkastelujen tavoitteena on ennen kaikkea kokeilla erilaisten menetelmien sovellettavuutta tässä aineistossa. Huomionarvoiset seikat liittyvät toisaalta yksittäisten regressiomallien ja rakenneyhtälömallin välisiin eroihin ja toisaalta siihen, mitä merkitystä aineiston monitasoisuuden huomioimisella on. Aluksi mallinnukset suoritetaan siten, että perherakennetta ei lainkaan huomioida. Työn myöhemmässä vaiheessa samoja menetelmiä käytetään uudelleen, tällä kertaa kuitenkin olettamatta havaintoja riippumattomiksi. Otanta-asetelman huomioiminen estimoinnissa ja toisaalta erilaiset monimuuttujamenetelmät ovat tunnettuja ja yleisesti sovellettuja. Kuitenkin menetelmät, jotka yhdistävät nämä kaksi aluetta, ovat vasta melko hiljattain vakiinnuttamassa asemaansa tutkimuksessa. Työn loppuosassa perehdytään jo melko monimutkaiseen analyysitapaan, kun sovelletaan monitasoista rakenneyhtälömallia. Eri menetelmillä saadut tulokset ovat hyvin samankaltaisia, eikä monitasoisuuden huomioiminen merkittävästi muuta analyysien tuloksia ja tulkintoja tässä aineistossa. Kokeilut antavat kuitenkin hyvän ja perusteellisen kuvan lääkityksen ja kognition välisistä suhteista ja auttavat ymmärtämään eri menetelmien välisiä suhteita.
Resumo:
The Thesis presents a state-space model for a basketball league and a Kalman filter algorithm for the estimation of the state of the league. In the state-space model, each of the basketball teams is associated with a rating that represents its strength compared to the other teams. The ratings are assumed to evolve in time following a stochastic process with independent Gaussian increments. The estimation of the team ratings is based on the observed game scores that are assumed to depend linearly on the true strengths of the teams and independent Gaussian noise. The team ratings are estimated using a recursive Kalman filter algorithm that produces least squares optimal estimates for the team strengths and predictions for the scores of the future games. Additionally, if the Gaussianity assumption holds, the predictions given by the Kalman filter maximize the likelihood of the observed scores. The team ratings allow probabilistic inference about the ranking of the teams and their relative strengths as well as about the teams’ winning probabilities in future games. The predictions about the winners of the games are correct 65-70% of the time. The team ratings explain 16% of the random variation observed in the game scores. Furthermore, the winning probabilities given by the model are concurrent with the observed scores. The state-space model includes four independent parameters that involve the variances of noise terms and the home court advantage observed in the scores. The Thesis presents the estimation of these parameters using the maximum likelihood method as well as using other techniques. The Thesis also gives various example analyses related to the American professional basketball league, i.e., National Basketball Association (NBA), and regular seasons played in year 2005 through 2010. Additionally, the season 2009-2010 is discussed in full detail, including the playoffs.