11 resultados para Causal inference
em Helda - Digital Repository of University of Helsinki
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
Spring barley is the most important crop in Finland based on cultivated land area. Net blotch, a disease caused by Pyrenophora teres Drech., is the most damaging disease of barley in Finland. The pressure to improve the economics and efficiency of agriculture has increased the need for more efficient plant protection methods. Development of durable host-plant resistance to net blotch is a promising possibility. However, deployment of disease resistant crops could initiate selection pressure on the pathogen (P. teres) population. The aim of this study was to understand the population biology of P. teres and to estimate the evolutionary potential of P. teres under selective pressure following deployment of resistance genes and application of fungicides. The study included mainly Finnish P. teres isolates. Population samples from Russia and Australia were also included. Using AFLP markers substantial genotypic variation in P. teres populations was identified. Differences among isolates were least within Finnish fields and significantly higher in Krasnodar, Russia. Genetic differentiation was identified among populations from northern Europe and from Australia, and between the two forms P. teres f. teres (PTT, net form of net blotch) and P. teres f. maculata (PTM, spot form of net blotch) in Australia. Differentiation among populations was also identified based on virulence between Finnish and Russian populations, and based on prochloraz (fungicide) tolerance in the Häme region in Finland. Surprisingly only PTT was recovered from Finland and Russia although both forms were earlier equally common in Finland. The reason for the shift in occurrence of forms in Finland remained uncertain. Both forms were found within several fields in Australia. Sexual reproduction of P. teres was supported by recover of both mating types in equal ratio in those areas although the prevalence of sexual mating seems to be less in Finland than in Australia. Population from Krasnodar was an exception since only one mating type was found in there. Based on the substantial high genotypic variation in Krasnodar it was suggested go represent an old P. teres population, whereas the Australian samples were suggested to represent newer populations. In conclusion, P. teres populations are differentiated at several levels. Human assistance in dispersal of P. teres on infected barley seed is obvious and decreases the differentiation among populations. This can increase the plant protection problems caused by this pathogen. P. teres is capable of sexual reproduction in several areas but the prevalence varies. Based on these findings it is apparent that P. teres has the potential to pose more serious problems in barley cultivation if plant protection is neglected. Therefore, good agricultural practices, including crop rotation and the use of healthy seed, are recommended.
Resumo:
Genetics, the science of heredity and variation in living organisms, has a central role in medicine, in breeding crops and livestock, and in studying fundamental topics of biological sciences such as evolution and cell functioning. Currently the field of genetics is under a rapid development because of the recent advances in technologies by which molecular data can be obtained from living organisms. In order that most information from such data can be extracted, the analyses need to be carried out using statistical models that are tailored to take account of the particular genetic processes. In this thesis we formulate and analyze Bayesian models for genetic marker data of contemporary individuals. The major focus is on the modeling of the unobserved recent ancestry of the sampled individuals (say, for tens of generations or so), which is carried out by using explicit probabilistic reconstructions of the pedigree structures accompanied by the gene flows at the marker loci. For such a recent history, the recombination process is the major genetic force that shapes the genomes of the individuals, and it is included in the model by assuming that the recombination fractions between the adjacent markers are known. The posterior distribution of the unobserved history of the individuals is studied conditionally on the observed marker data by using a Markov chain Monte Carlo algorithm (MCMC). The example analyses consider estimation of the population structure, relatedness structure (both at the level of whole genomes as well as at each marker separately), and haplotype configurations. For situations where the pedigree structure is partially known, an algorithm to create an initial state for the MCMC algorithm is given. Furthermore, the thesis includes an extension of the model for the recent genetic history to situations where also a quantitative phenotype has been measured from the contemporary individuals. In that case the goal is to identify positions on the genome that affect the observed phenotypic values. This task is carried out within the Bayesian framework, where the number and the relative effects of the quantitative trait loci are treated as random variables whose posterior distribution is studied conditionally on the observed genetic and phenotypic data. In addition, the thesis contains an extension of a widely-used haplotyping method, the PHASE algorithm, to settings where genetic material from several individuals has been pooled together, and the allele frequencies of each pool are determined in a single genotyping.
Resumo:
This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.
Resumo:
The goal of this study was to examine the role of organizational causal attribution in understanding the relation of work stressors (work-role overload, excessive role responsibility, and unpleasant physical environment) and personal resources (social support and cognitive coping) to such organizational-attitudinal outcomes as work engagement, turnover intention, and organizational identification. In some analyses, cognitive coping was also treated as an organizational outcome. Causal attribution was conceptualized in terms of four dimensions: internality-externality, attributing the cause of one’s successes and failures to oneself, as opposed to external factors, stability (thinking that the cause of one’s successes and failures is stable over time), globality (perceiving the cause to be operative on many areas of one’s life), and controllability (believing that one can control the causes of one’s successes and failures). Several hypotheses were derived from Karasek’s (1989) Job Demands–Control (JD-C) model and from the Job Demands–Resources (JD-R) model (Demerouti, Bakker, Nachreiner & Schaufeli, 2001). Based on the JD-C model, a number of moderation effects were predicted, stating that the strength of the association of work stressors with the outcome variables (e.g. turnover intentions) varies as a function of the causal attribution; for example, unpleasant work environment is more strongly associated with turnover intention among those with an external locus of causality than among those with an internal locuse of causality. From the JD-R model, a number of hypotheses on the mediation model were derived. They were based on two processes posited by the model: an energy-draining process in which work stressors along with a mediating effect of causal attribution for failures deplete the nurses’ energy, leading to turnover intention, and a motivational process in which personal resources along with a mediating effect of causal attribution for successes foster the nurses’ engagement in their work, leading to higher organizational identification and to decreased intention to leave the nursing job. For instance, it was expected that the relationship between work stressors and turnover intention could be explained (mediated) by a tendency to attribute one’s work failures to stable causes. The data were collected from among Finnish hospital nurses using e-questionnaires. Overall 934 nurses responded the questionnaires. Work stressors and personal resources were measured by five scales derived from the Occupational Stress Inventory-Revised (Osipow, 1998). Causal attribution was measured using the Occupational Attributional Style Questionnaire (Furnham, 2004). Work engagement was assessed through the Utrecht Work Engagement Scale (Schaufeli & al., 2002), turnover intention by the Van Veldhoven & Meijman (1994) scale, and organizational identification by the Mael & Ashforth (1992) measure. The results provided support for the function of causal attribution in the overall work stress process. Findings related to the moderation model can be divided into three main findings. First, external locus of causality along with job level moderated the relationship between work overload and cognitive coping. Hence, this interaction was evidenced only among nurses in non-supervisory positions. Second, external locus of causality and job level together moderated the relationship between physical environment and turnover intention. An opposite pattern of interaction was found for this interaction: among nurses, externality exacerbated the effect of perceived unpleasantness of the physical environment on turnover intention, whereas among supervisors internality produced the same effect. Third, job level also disclosed a moderation effect for controllability attribution over the relationship between physical environment and cognitive coping. Findings related to the mediation model for the energetic process indicated that the partial model in which work stressors have also a direct effect on turnover intention fitted the data better. In the mediation model for the motivational process, an intermediate mediation effect in which the effects of personal resources on turnover intention went through two mediators (e.g., causal dimensions and organizational identification) fitted the data better. All dimensions of causal attribution appeared to follow a somewhat unique pattern of mediation effect not only for energetic but also for motivational processes. Overall findings on mediation models partly supported the two simultaneous underlying processes proposed by the JD-R model. While in the energetic process the dimension of externality mediated the relationship between stressors and turnover partially, all the dimensions of causal attribution appeared to entail significant mediator effects in the motivational process. The general findings supported the moderation effect and the mediation effect of causal attribution in the work stress process. The study contributes to several research traditions, including the interaction approach, the JD-C, and the JD-R models. However, many potential functions of organizational causal attribution are yet to be evaluated by relevant academic and organizational research. Keywords: organizational causal attribution, optimistic / pessimistic attributional style, work stressors, organisational stress process, stressors in nursing profession, hospital nursing, JD-R model, personal resources, turnover intention, work engagement, organizational identification.
Resumo:
In the thesis we consider inference for cointegration in vector autoregressive (VAR) models. The thesis consists of an introduction and four papers. The first paper proposes a new test for cointegration in VAR models that is directly based on the eigenvalues of the least squares (LS) estimate of the autoregressive matrix. In the second paper we compare a small sample correction for the likelihood ratio (LR) test of cointegrating rank and the bootstrap. The simulation experiments show that the bootstrap works very well in practice and dominates the correction factor. The tests are applied to international stock prices data, and the .nite sample performance of the tests are investigated by simulating the data. The third paper studies the demand for money in Sweden 1970—2000 using the I(2) model. In the fourth paper we re-examine the evidence of cointegration between international stock prices. The paper shows that some of the previous empirical results can be explained by the small-sample bias and size distortion of Johansen’s LR tests for cointegration. In all papers we work with two data sets. The first data set is a Swedish money demand data set with observations on the money stock, the consumer price index, gross domestic product (GDP), the short-term interest rate and the long-term interest rate. The data are quarterly and the sample period is 1970(1)—2000(1). The second data set consists of month-end stock market index observations for Finland, France, Germany, Sweden, the United Kingdom and the United States from 1980(1) to 1997(2). Both data sets are typical of the sample sizes encountered in economic data, and the applications illustrate the usefulness of the models and tests discussed in the thesis.
Resumo:
Modern sample surveys started to spread after statistician at the U.S. Bureau of the Census in the 1940s had developed a sampling design for the Current Population Survey (CPS). A significant factor was also that digital computers became available for statisticians. In the beginning of 1950s, the theory was documented in textbooks on survey sampling. This thesis is about the development of the statistical inference for sample surveys. For the first time the idea of statistical inference was enunciated by a French scientist, P. S. Laplace. In 1781, he published a plan for a partial investigation in which he determined the sample size needed to reach the desired accuracy in estimation. The plan was based on Laplace s Principle of Inverse Probability and on his derivation of the Central Limit Theorem. They were published in a memoir in 1774 which is one of the origins of statistical inference. Laplace s inference model was based on Bernoulli trials and binominal probabilities. He assumed that populations were changing constantly. It was depicted by assuming a priori distributions for parameters. Laplace s inference model dominated statistical thinking for a century. Sample selection in Laplace s investigations was purposive. In 1894 in the International Statistical Institute meeting, Norwegian Anders Kiaer presented the idea of the Representative Method to draw samples. Its idea was that the sample would be a miniature of the population. It is still prevailing. The virtues of random sampling were known but practical problems of sample selection and data collection hindered its use. Arhtur Bowley realized the potentials of Kiaer s method and in the beginning of the 20th century carried out several surveys in the UK. He also developed the theory of statistical inference for finite populations. It was based on Laplace s inference model. R. A. Fisher contributions in the 1920 s constitute a watershed in the statistical science He revolutionized the theory of statistics. In addition, he introduced a new statistical inference model which is still the prevailing paradigm. The essential idea is to draw repeatedly samples from the same population and the assumption that population parameters are constants. Fisher s theory did not include a priori probabilities. Jerzy Neyman adopted Fisher s inference model and applied it to finite populations with the difference that Neyman s inference model does not include any assumptions of the distributions of the study variables. Applying Fisher s fiducial argument he developed the theory for confidence intervals. Neyman s last contribution to survey sampling presented a theory for double sampling. This gave the central idea for statisticians at the U.S. Census Bureau to develop the complex survey design for the CPS. Important criterion was to have a method in which the costs of data collection were acceptable, and which provided approximately equal interviewer workloads, besides sufficient accuracy in estimation.