940 resultados para VARIABLE LENGTH MARKOV CHAINS
Resumo:
This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.
Resumo:
The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.
Resumo:
We show how to construct a topological Markov map of the interval whose invariant probability measure is the stationary law of a given stochastic chain of infinite order. In particular we characterize the maps corresponding to stochastic chains with memory of variable length. The problem treated here is the converse of the classical construction of the Gibbs formalism for Markov expanding maps of the interval.
Resumo:
The uniformization method (also known as randomization) is a numerically stable algorithm for computing transient distributions of a continuous time Markov chain. When the solution is needed after a long run or when the convergence is slow, the uniformization method involves a large number of matrix-vector products. Despite this, the method remains very popular due to its ease of implementation and its reliability in many practical circumstances. Because calculating the matrix-vector product is the most time-consuming part of the method, overall efficiency in solving large-scale problems can be significantly enhanced if the matrix-vector product is made more economical. In this paper, we incorporate a new relaxation strategy into the uniformization method to compute the matrix-vector products only approximately. We analyze the error introduced by these inexact matrix-vector products and discuss strategies for refining the accuracy of the relaxation while reducing the execution cost. Numerical experiments drawn from computer systems and biological systems are given to show that significant computational savings are achieved in practical applications.
Resumo:
This paper develops maximum likelihood (ML) estimation schemes for finite-state semi-Markov chains in white Gaussian noise. We assume that the semi-Markov chain is characterised by transition probabilities of known parametric from with unknown parameters. We reformulate this hidden semi-Markov model (HSM) problem in the scalar case as a two-vector homogeneous hidden Markov model (HMM) problem in which the state consist of the signal augmented by the time to last transition. With this reformulation we apply the expectation Maximumisation (EM ) algorithm to obtain ML estimates of the transition probabilities parameters, Markov state levels and noise variance. To demonstrate our proposed schemes, motivated by neuro-biological applications, we use a damped sinusoidal parameterised function for the transition probabilities.
Resumo:
The ergodic or long-run average cost control problem for a partially observed finite-state Markov chain is studied via the associated fully observed separated control problem for the nonlinear filter. Dynamic programming equations for the latter are derived, leading to existence and characterization of optimal stationary policies.
Resumo:
Milito and Cruz have introduced a novel adaptive control scheme for finite Markov chains when a finite parametrized family of possible transition matrices is available. The scheme involves the minimization of a composite functional of the observed history of the process incorporating both control and estimation aspects. We prove the a.s. optimality of a similar scheme when the state space is countable and the parameter space a compact subset ofR.
Resumo:
We study risk-sensitive control of continuous time Markov chains taking values in discrete state space. We study both finite and infinite horizon problems. In the finite horizon problem we characterize the value function via Hamilton Jacobi Bellman equation and obtain an optimal Markov control. We do the same for infinite horizon discounted cost case. In the infinite horizon average cost case we establish the existence of an optimal stationary control under certain Lyapunov condition. We also develop a policy iteration algorithm for finding an optimal control.
Resumo:
We develop a general theory of Markov chains realizable as random walks on R-trivial monoids. It provides explicit and simple formulas for the eigenvalues of the transition matrix, for multiplicities of the eigenvalues via Mobius inversion along a lattice, a condition for diagonalizability of the transition matrix and some techniques for bounding the mixing time. In addition, we discuss several examples, such as Toom-Tsetlin models, an exchange walk for finite Coxeter groups, as well as examples previously studied by the authors, such as nonabelian sandpile models and the promotion Markov chain on posets. Many of these examples can be viewed as random walks on quotients of free tree monoids, a new class of monoids whose combinatorics we develop.