11 resultados para Cadeias de Markov
em Helda - Digital Repository of University of Helsinki
Resumo:
Genetics, the science of heredity and variation in living organisms, has a central role in medicine, in breeding crops and livestock, and in studying fundamental topics of biological sciences such as evolution and cell functioning. Currently the field of genetics is under a rapid development because of the recent advances in technologies by which molecular data can be obtained from living organisms. In order that most information from such data can be extracted, the analyses need to be carried out using statistical models that are tailored to take account of the particular genetic processes. In this thesis we formulate and analyze Bayesian models for genetic marker data of contemporary individuals. The major focus is on the modeling of the unobserved recent ancestry of the sampled individuals (say, for tens of generations or so), which is carried out by using explicit probabilistic reconstructions of the pedigree structures accompanied by the gene flows at the marker loci. For such a recent history, the recombination process is the major genetic force that shapes the genomes of the individuals, and it is included in the model by assuming that the recombination fractions between the adjacent markers are known. The posterior distribution of the unobserved history of the individuals is studied conditionally on the observed marker data by using a Markov chain Monte Carlo algorithm (MCMC). The example analyses consider estimation of the population structure, relatedness structure (both at the level of whole genomes as well as at each marker separately), and haplotype configurations. For situations where the pedigree structure is partially known, an algorithm to create an initial state for the MCMC algorithm is given. Furthermore, the thesis includes an extension of the model for the recent genetic history to situations where also a quantitative phenotype has been measured from the contemporary individuals. In that case the goal is to identify positions on the genome that affect the observed phenotypic values. This task is carried out within the Bayesian framework, where the number and the relative effects of the quantitative trait loci are treated as random variables whose posterior distribution is studied conditionally on the observed genetic and phenotypic data. In addition, the thesis contains an extension of a widely-used haplotyping method, the PHASE algorithm, to settings where genetic material from several individuals has been pooled together, and the allele frequencies of each pool are determined in a single genotyping.
Resumo:
Markov random fields (MRF) are popular in image processing applications to describe spatial dependencies between image units. Here, we take a look at the theory and the models of MRFs with an application to improve forest inventory estimates. Typically, autocorrelation between study units is a nuisance in statistical inference, but we take an advantage of the dependencies to smooth noisy measurements by borrowing information from the neighbouring units. We build a stochastic spatial model, which we estimate with a Markov chain Monte Carlo simulation method. The smooth values are validated against another data set increasing our confidence that the estimates are more accurate than the originals.
Resumo:
In this thesis the use of the Bayesian approach to statistical inference in fisheries stock assessment is studied. The work was conducted in collaboration of the Finnish Game and Fisheries Research Institute by using the problem of monitoring and prediction of the juvenile salmon population in the River Tornionjoki as an example application. The River Tornionjoki is the largest salmon river flowing into the Baltic Sea. This thesis tackles the issues of model formulation and model checking as well as computational problems related to Bayesian modelling in the context of fisheries stock assessment. Each article of the thesis provides a novel method either for extracting information from data obtained via a particular type of sampling system or for integrating the information about the fish stock from multiple sources in terms of a population dynamics model. Mark-recapture and removal sampling schemes and a random catch sampling method are covered for the estimation of the population size. In addition, a method for estimating the stock composition of a salmon catch based on DNA samples is also presented. For most of the articles, Markov chain Monte Carlo (MCMC) simulation has been used as a tool to approximate the posterior distribution. Problems arising from the sampling method are also briefly discussed and potential solutions for these problems are proposed. Special emphasis in the discussion is given to the philosophical foundation of the Bayesian approach in the context of fisheries stock assessment. It is argued that the role of subjective prior knowledge needed in practically all parts of a Bayesian model should be recognized and consequently fully utilised in the process of model formulation.
Resumo:
Frictions are factors that hinder trading of securities in financial markets. Typical frictions include limited market depth, transaction costs, lack of infinite divisibility of securities, and taxes. Conventional models used in mathematical finance often gloss over these issues, which affect almost all financial markets, by arguing that the impact of frictions is negligible and, consequently, the frictionless models are valid approximations. This dissertation consists of three research papers, which are related to the study of the validity of such approximations in two distinct modeling problems. Models of price dynamics that are based on diffusion processes, i.e., continuous strong Markov processes, are widely used in the frictionless scenario. The first paper establishes that diffusion models can indeed be understood as approximations of price dynamics in markets with frictions. This is achieved by introducing an agent-based model of a financial market where finitely many agents trade a financial security, the price of which evolves according to price impacts generated by trades. It is shown that, if the number of agents is large, then under certain assumptions the price process of security, which is a pure-jump process, can be approximated by a one-dimensional diffusion process. In a slightly extended model, in which agents may exhibit herd behavior, the approximating diffusion model turns out to be a stochastic volatility model. Finally, it is shown that when agents' tendency to herd is strong, logarithmic returns in the approximating stochastic volatility model are heavy-tailed. The remaining papers are related to no-arbitrage criteria and superhedging in continuous-time option pricing models under small-transaction-cost asymptotics. Guasoni, Rásonyi, and Schachermayer have recently shown that, in such a setting, any financial security admits no arbitrage opportunities and there exist no feasible superhedging strategies for European call and put options written on it, as long as its price process is continuous and has the so-called conditional full support (CFS) property. Motivated by this result, CFS is established for certain stochastic integrals and a subclass of Brownian semistationary processes in the two papers. As a consequence, a wide range of possibly non-Markovian local and stochastic volatility models have the CFS property.
Resumo:
Elucidating the mechanisms responsible for the patterns of species abundance, diversity, and distribution within and across ecological systems is a fundamental research focus in ecology. Species abundance patterns are shaped in a convoluted way by interplays between inter-/intra-specific interactions, environmental forcing, demographic stochasticity, and dispersal. Comprehensive models and suitable inferential and computational tools for teasing out these different factors are quite limited, even though such tools are critically needed to guide the implementation of management and conservation strategies, the efficacy of which rests on a realistic evaluation of the underlying mechanisms. This is even more so in the prevailing context of concerns over climate change progress and its potential impacts on ecosystems. This thesis utilized the flexible hierarchical Bayesian modelling framework in combination with the computer intensive methods known as Markov chain Monte Carlo, to develop methodologies for identifying and evaluating the factors that control the structure and dynamics of ecological communities. These methodologies were used to analyze data from a range of taxa: macro-moths (Lepidoptera), fish, crustaceans, birds, and rodents. Environmental stochasticity emerged as the most important driver of community dynamics, followed by density dependent regulation; the influence of inter-specific interactions on community-level variances was broadly minor. This thesis contributes to the understanding of the mechanisms underlying the structure and dynamics of ecological communities, by showing directly that environmental fluctuations rather than inter-specific competition dominate the dynamics of several systems. This finding emphasizes the need to better understand how species are affected by the environment and acknowledge species differences in their responses to environmental heterogeneity, if we are to effectively model and predict their dynamics (e.g. for management and conservation purposes). The thesis also proposes a model-based approach to integrating the niche and neutral perspectives on community structure and dynamics, making it possible for the relative importance of each category of factors to be evaluated in light of field data.
Resumo:
What can the statistical structure of natural images teach us about the human brain? Even though the visual cortex is one of the most studied parts of the brain, surprisingly little is known about how exactly images are processed to leave us with a coherent percept of the world around us, so we can recognize a friend or drive on a crowded street without any effort. By constructing probabilistic models of natural images, the goal of this thesis is to understand the structure of the stimulus that is the raison d etre for the visual system. Following the hypothesis that the optimal processing has to be matched to the structure of that stimulus, we attempt to derive computational principles, features that the visual system should compute, and properties that cells in the visual system should have. Starting from machine learning techniques such as principal component analysis and independent component analysis we construct a variety of sta- tistical models to discover structure in natural images that can be linked to receptive field properties of neurons in primary visual cortex such as simple and complex cells. We show that by representing images with phase invariant, complex cell-like units, a better statistical description of the vi- sual environment is obtained than with linear simple cell units, and that complex cell pooling can be learned by estimating both layers of a two-layer model of natural images. We investigate how a simplified model of the processing in the retina, where adaptation and contrast normalization take place, is connected to the nat- ural stimulus statistics. Analyzing the effect that retinal gain control has on later cortical processing, we propose a novel method to perform gain control in a data-driven way. Finally we show how models like those pre- sented here can be extended to capture whole visual scenes rather than just small image patches. By using a Markov random field approach we can model images of arbitrary size, while still being able to estimate the model parameters from the data.
Resumo:
This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.
Resumo:
The increased accuracy in the cosmological observations, especially in the measurements of the comic microwave background, allow us to study the primordial perturbations in grater detail. In this thesis, we allow the possibility for a correlated isocurvature perturbations alongside the usual adiabatic perturbations. Thus far the simplest six parameter \Lambda CDM model has been able to accommodate all the observational data rather well. However, we find that the 3-year WMAP data and the 2006 Boomerang data favour a nonzero nonadiabatic contribution to the CMB angular power sprctrum. This is primordial isocurvature perturbation that is positively correlated with the primordial curvature perturbation. Compared with the adiabatic \Lambda CMD model we have four additional parameters describing the increased complexity if the primordial perturbations. Our best-fit model has a 4% nonadiabatic contribution to the CMB temperature variance and the fit is improved by \Delta\chi^2 = 9.7. We can attribute this preference for isocurvature to a feature in the peak structure of the angular power spectrum, namely, the widths of the second and third acoustic peak. Along the way, we have improved our analysis methods by identifying some issues with the parametrisation of the primordial perturbation spectra and suggesting ways to handle these. Due to the improvements, the convergence of our Markov chains is improved. The change of parametrisation has an effect on the MCMC analysis because of the change in priors. We have checked our results against this and find only marginal differences between our parametrisation.
Resumo:
Aerosols impact the planet and our daily lives through various effects, perhaps most notably those related to their climatic and health-related consequences. While there are several primary particle sources, secondary new particle formation from precursor vapors is also known to be a frequent, global phenomenon. Nevertheless, the formation mechanism of new particles, as well as the vapors participating in the process, remain a mystery. This thesis consists of studies on new particle formation specifically from the point of view of numerical modeling. A dependence of formation rate of 3 nm particles on the sulphuric acid concentration to the power of 1-2 has been observed. This suggests nucleation mechanism to be of first or second order with respect to the sulphuric acid concentration, in other words the mechanisms based on activation or kinetic collision of clusters. However, model studies have had difficulties in replicating the small exponents observed in nature. The work done in this thesis indicates that the exponents may be lowered by the participation of a co-condensing (and potentially nucleating) low-volatility organic vapor, or by increasing the assumed size of the critical clusters. On the other hand, the presented new and more accurate method for determining the exponent indicates high diurnal variability. Additionally, these studies included several semi-empirical nucleation rate parameterizations as well as a detailed investigation of the analysis used to determine the apparent particle formation rate. Due to their high proportion of the earth's surface area, oceans could potentially prove to be climatically significant sources of secondary particles. In the lack of marine observation data, new particle formation events in a coastal region were parameterized and studied. Since the formation mechanism is believed to be similar, the new parameterization was applied in a marine scenario. The work showed that marine CCN production is feasible in the presence of additional vapors contributing to particle growth. Finally, a new method to estimate concentrations of condensing organics was developed. The algorithm utilizes a Markov chain Monte Carlo method to determine the required combination of vapor concentrations by comparing a measured particle size distribution with one from an aerosol dynamics process model. The evaluation indicated excellent agreement against model data, and initial results with field data appear sound as well.