916 resultados para approximated inference
Resumo:
The increasing amount of information that is annotated against standardised semantic resources offers opportunities to incorporate sophisticated levels of reasoning, or inference, into the retrieval process. In this position paper, we reflect on the need to incorporate semantic inference into retrieval (in particular for medical information retrieval) as well as previous attempts that have been made so far with mixed success. Medical information retrieval is a fertile ground for testing inference mechanisms to augment retrieval. The medical domain offers a plethora of carefully curated, structured, semantic resources, along with well established entity extraction and linking tools, and search topics that intuitively require a number of different inferential processes (e.g., conceptual similarity, conceptual implication, etc.). We argue that integrating semantic inference in information retrieval has the potential to uncover a large amount of information that otherwise would be inaccessible; but inference is also risky and, if not used cautiously, can harm retrieval.
Resumo:
A recurring question for cognitive science is whether functional neuroimaging data can provide evidence for or against psychological theories. As posed, the question reflects an adherence to a popular scientific method known as 'strong inference'. The method entails constructing multiple hypotheses (Hs) and designing experiments so that alternative possible outcomes will refute at least one (i.e., 'falsify' it). In this article, after first delineating some well-documented limitations of strong inference, I provide examples of functional neuroimaging data being used to test Hs from rival modular information-processing models of spoken word production. 'Strong inference' for neuroimaging involves first establishing a systematic mapping of 'processes to processors' for a common modular architecture. Alternate Hs are then constructed from psychological theories that attribute the outcome of manipulating an experimental factor to two or more distinct processing stages within this architecture. Hs are then refutable by a finding of activity differentiated spatially and chronometrically by experimental condition. When employed in this manner, the data offered by functional neuroimaging may be more useful for adjudicating between accounts of processing loci than behavioural measures.
Resumo:
Recent axiomatic derivations of the maximum entropy principle from consistency conditions are critically examined. We show that proper application of consistency conditions alone allows a wider class of functionals, essentially of the form ∝ dx p(x)[p(x)/g(x)] s , for some real numbers, to be used for inductive inference and the commonly used form − ∝ dx p(x)ln[p(x)/g(x)] is only a particular case. The role of the prior densityg(x) is clarified. It is possible to regard it as a geometric factor, describing the coordinate system used and it does not represent information of the same kind as obtained by measurements on the system in the form of expectation values.
Inference of the genetic architecture underlying BMI and height with the use of 20,240 sibling pairs
Resumo:
Evidence that complex traits are highly polygenic has been presented by population-based genome-wide association studies (GWASs) through the identification of many significant variants, as well as by family-based de novo sequencing studies indicating that several traits have a large mutational target size. Here, using a third study design, we show results consistent with extreme polygenicity for body mass index (BMI) and height. On a sample of 20,240 siblings (from 9,570 nuclear families), we used a within-family method to obtain narrow-sense heritability estimates of 0.42 (SE = 0.17, p = 0.01) and 0.69 (SE = 0.14, p = 6 x 10(-)(7)) for BMI and height, respectively, after adjusting for covariates. The genomic inflation factors from locus-specific linkage analysis were 1.69 (SE = 0.21, p = 0.04) for BMI and 2.18 (SE = 0.21, p = 2 x 10(-10)) for height. This inflation is free of confounding and congruent with polygenicity, consistent with observations of ever-increasing genomic-inflation factors from GWASs with large sample sizes, implying that those signals are due to true genetic signals across the genome rather than population stratification. We also demonstrate that the distribution of the observed test statistics is consistent with both rare and common variants underlying a polygenic architecture and that previous reports of linkage signals in complex traits are probably a consequence of polygenic architecture rather than the segregation of variants with large effects. The convergent empirical evidence from GWASs, de novo studies, and within-family segregation implies that family-based sequencing studies for complex traits require very large sample sizes because the effects of causal variants are small on average.
Resumo:
The family of location and scale mixtures of Gaussians has the ability to generate a number of flexible distributional forms. The family nests as particular cases several important asymmetric distributions like the Generalized Hyperbolic distribution. The Generalized Hyperbolic distribution in turn nests many other well known distributions such as the Normal Inverse Gaussian. In a multivariate setting, an extension of the standard location and scale mixture concept is proposed into a so called multiple scaled framework which has the advantage of allowing different tail and skewness behaviours in each dimension with arbitrary correlation between dimensions. Estimation of the parameters is provided via an EM algorithm and extended to cover the case of mixtures of such multiple scaled distributions for application to clustering. Assessments on simulated and real data confirm the gain in degrees of freedom and flexibility in modelling data of varying tail behaviour and directional shape.
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
Genetics, the science of heredity and variation in living organisms, has a central role in medicine, in breeding crops and livestock, and in studying fundamental topics of biological sciences such as evolution and cell functioning. Currently the field of genetics is under a rapid development because of the recent advances in technologies by which molecular data can be obtained from living organisms. In order that most information from such data can be extracted, the analyses need to be carried out using statistical models that are tailored to take account of the particular genetic processes. In this thesis we formulate and analyze Bayesian models for genetic marker data of contemporary individuals. The major focus is on the modeling of the unobserved recent ancestry of the sampled individuals (say, for tens of generations or so), which is carried out by using explicit probabilistic reconstructions of the pedigree structures accompanied by the gene flows at the marker loci. For such a recent history, the recombination process is the major genetic force that shapes the genomes of the individuals, and it is included in the model by assuming that the recombination fractions between the adjacent markers are known. The posterior distribution of the unobserved history of the individuals is studied conditionally on the observed marker data by using a Markov chain Monte Carlo algorithm (MCMC). The example analyses consider estimation of the population structure, relatedness structure (both at the level of whole genomes as well as at each marker separately), and haplotype configurations. For situations where the pedigree structure is partially known, an algorithm to create an initial state for the MCMC algorithm is given. Furthermore, the thesis includes an extension of the model for the recent genetic history to situations where also a quantitative phenotype has been measured from the contemporary individuals. In that case the goal is to identify positions on the genome that affect the observed phenotypic values. This task is carried out within the Bayesian framework, where the number and the relative effects of the quantitative trait loci are treated as random variables whose posterior distribution is studied conditionally on the observed genetic and phenotypic data. In addition, the thesis contains an extension of a widely-used haplotyping method, the PHASE algorithm, to settings where genetic material from several individuals has been pooled together, and the allele frequencies of each pool are determined in a single genotyping.
Resumo:
This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.
Resumo:
In the thesis we consider inference for cointegration in vector autoregressive (VAR) models. The thesis consists of an introduction and four papers. The first paper proposes a new test for cointegration in VAR models that is directly based on the eigenvalues of the least squares (LS) estimate of the autoregressive matrix. In the second paper we compare a small sample correction for the likelihood ratio (LR) test of cointegrating rank and the bootstrap. The simulation experiments show that the bootstrap works very well in practice and dominates the correction factor. The tests are applied to international stock prices data, and the .nite sample performance of the tests are investigated by simulating the data. The third paper studies the demand for money in Sweden 1970—2000 using the I(2) model. In the fourth paper we re-examine the evidence of cointegration between international stock prices. The paper shows that some of the previous empirical results can be explained by the small-sample bias and size distortion of Johansen’s LR tests for cointegration. In all papers we work with two data sets. The first data set is a Swedish money demand data set with observations on the money stock, the consumer price index, gross domestic product (GDP), the short-term interest rate and the long-term interest rate. The data are quarterly and the sample period is 1970(1)—2000(1). The second data set consists of month-end stock market index observations for Finland, France, Germany, Sweden, the United Kingdom and the United States from 1980(1) to 1997(2). Both data sets are typical of the sample sizes encountered in economic data, and the applications illustrate the usefulness of the models and tests discussed in the thesis.
Resumo:
Modern sample surveys started to spread after statistician at the U.S. Bureau of the Census in the 1940s had developed a sampling design for the Current Population Survey (CPS). A significant factor was also that digital computers became available for statisticians. In the beginning of 1950s, the theory was documented in textbooks on survey sampling. This thesis is about the development of the statistical inference for sample surveys. For the first time the idea of statistical inference was enunciated by a French scientist, P. S. Laplace. In 1781, he published a plan for a partial investigation in which he determined the sample size needed to reach the desired accuracy in estimation. The plan was based on Laplace s Principle of Inverse Probability and on his derivation of the Central Limit Theorem. They were published in a memoir in 1774 which is one of the origins of statistical inference. Laplace s inference model was based on Bernoulli trials and binominal probabilities. He assumed that populations were changing constantly. It was depicted by assuming a priori distributions for parameters. Laplace s inference model dominated statistical thinking for a century. Sample selection in Laplace s investigations was purposive. In 1894 in the International Statistical Institute meeting, Norwegian Anders Kiaer presented the idea of the Representative Method to draw samples. Its idea was that the sample would be a miniature of the population. It is still prevailing. The virtues of random sampling were known but practical problems of sample selection and data collection hindered its use. Arhtur Bowley realized the potentials of Kiaer s method and in the beginning of the 20th century carried out several surveys in the UK. He also developed the theory of statistical inference for finite populations. It was based on Laplace s inference model. R. A. Fisher contributions in the 1920 s constitute a watershed in the statistical science He revolutionized the theory of statistics. In addition, he introduced a new statistical inference model which is still the prevailing paradigm. The essential idea is to draw repeatedly samples from the same population and the assumption that population parameters are constants. Fisher s theory did not include a priori probabilities. Jerzy Neyman adopted Fisher s inference model and applied it to finite populations with the difference that Neyman s inference model does not include any assumptions of the distributions of the study variables. Applying Fisher s fiducial argument he developed the theory for confidence intervals. Neyman s last contribution to survey sampling presented a theory for double sampling. This gave the central idea for statisticians at the U.S. Census Bureau to develop the complex survey design for the CPS. Important criterion was to have a method in which the costs of data collection were acceptable, and which provided approximately equal interviewer workloads, besides sufficient accuracy in estimation.
Resumo:
Our ability to infer the protein quaternary structure automatically from atom and lattice information is inadequate, especially for weak complexes, and heteromeric quaternary structures. Several approaches exist, but they have limited performance. Here, we present a new scheme to infer protein quaternary structure from lattice and protein information, with all-around coverage for strong, weak and very weak affinity homomeric and heteromeric complexes. The scheme combines naive Bayes classifier and point group symmetry under Boolean framework to detect quaternary structures in crystal lattice. It consistently produces >= 90% coverage across diverse benchmarking data sets, including a notably superior 95% coverage for recognition heteromeric complexes, compared with 53% on the same data set by current state-of-the-art method. The detailed study of a limited number of prediction-failed cases offers interesting insights into the intriguing nature of protein contacts in lattice. The findings have implications for accurate inference of quaternary states of proteins, especially weak affinity complexes.
Resumo:
Satisfiability algorithms for propositional logic have improved enormously in recently years. This improvement increases the attractiveness of satisfiability methods for first-order logic that reduce the problem to a series of ground-level satisfiability problems. R. Jeroslow introduced a partial instantiation method of this kind that differs radically from the standard resolution-based methods. This paper lays the theoretical groundwork for an extension of his method that is general enough and efficient enough for general logic programming with indefinite clauses. In particular we improve Jeroslow's approach by (1) extending it to logic with functions, (2) accelerating it through the use of satisfiers, as introduced by Gallo and Rago, and (3) simplifying it to obtain further speedup. We provide a similar development for a "dual" partial instantiation approach defined by Hooker and suggest a primal-dual strategy. We prove correctness of the primal and dual algorithms for full first-order logic with functions, as well as termination on unsatisfiable formulas. We also report some preliminary computational results.
Resumo:
Prediction of variable bit rate compressed video traffic is critical to dynamic allocation of resources in a network. In this paper, we propose a technique for preprocessing the dataset used for training a video traffic predictor. The technique involves identifying the noisy instances in the data using a fuzzy inference system. We focus on three prediction techniques, namely, linear regression, neural network and support vector regression and analyze their performance on H.264 video traces. Our experimental results reveal that data preprocessing greatly improves the performance of linear regression and neural network, but is not effective on support vector regression.
Resumo:
Null dereferences are a bane of programming in languages such as Java. In this paper we propose a sound, demand-driven, inter-procedurally context-sensitive dataflow analysis technique to verify a given dereference as safe or potentially unsafe. Our analysis uses an abstract lattice of formulas to find a pre-condition at the entry of the program such that a null-dereference can occur only if the initial state of the program satisfies this pre-condition. We use a simplified domain of formulas, abstracting out integer arithmetic, as well as unbounded access paths due to recursive data structures. For the sake of precision we model aliasing relationships explicitly in our abstract lattice, enable strong updates, and use a limited notion of path sensitivity. For the sake of scalability we prune formulas continually as they get propagated, reducing to true conjuncts that are less likely to be useful in validating or invalidating the formula. We have implemented our approach, and present an evaluation of it on a set of ten real Java programs. Our results show that the set of design features we have incorporated enable the analysis to (a) explore long, inter-procedural paths to verify each dereference, with (b) reasonable accuracy, and (c) very quick response time per dereference, making it suitable for use in desktop development environments.