32 resultados para STOCHASTIC MODELING

em Helda - Digital Repository of University of Helsinki


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Frictions are factors that hinder trading of securities in financial markets. Typical frictions include limited market depth, transaction costs, lack of infinite divisibility of securities, and taxes. Conventional models used in mathematical finance often gloss over these issues, which affect almost all financial markets, by arguing that the impact of frictions is negligible and, consequently, the frictionless models are valid approximations. This dissertation consists of three research papers, which are related to the study of the validity of such approximations in two distinct modeling problems. Models of price dynamics that are based on diffusion processes, i.e., continuous strong Markov processes, are widely used in the frictionless scenario. The first paper establishes that diffusion models can indeed be understood as approximations of price dynamics in markets with frictions. This is achieved by introducing an agent-based model of a financial market where finitely many agents trade a financial security, the price of which evolves according to price impacts generated by trades. It is shown that, if the number of agents is large, then under certain assumptions the price process of security, which is a pure-jump process, can be approximated by a one-dimensional diffusion process. In a slightly extended model, in which agents may exhibit herd behavior, the approximating diffusion model turns out to be a stochastic volatility model. Finally, it is shown that when agents' tendency to herd is strong, logarithmic returns in the approximating stochastic volatility model are heavy-tailed. The remaining papers are related to no-arbitrage criteria and superhedging in continuous-time option pricing models under small-transaction-cost asymptotics. Guasoni, Rásonyi, and Schachermayer have recently shown that, in such a setting, any financial security admits no arbitrage opportunities and there exist no feasible superhedging strategies for European call and put options written on it, as long as its price process is continuous and has the so-called conditional full support (CFS) property. Motivated by this result, CFS is established for certain stochastic integrals and a subclass of Brownian semistationary processes in the two papers. As a consequence, a wide range of possibly non-Markovian local and stochastic volatility models have the CFS property.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The future use of genetically modified (GM) plants in food, feed and biomass production requires a careful consideration of possible risks related to the unintended spread of trangenes into new habitats. This may occur via introgression of the transgene to conventional genotypes, due to cross-pollination, and via the invasion of GM plants to new habitats. Assessment of possible environmental impacts of GM plants requires estimation of the level of gene flow from a GM population. Furthermore, management measures for reducing gene flow from GM populations are needed in order to prevent possible unwanted effects of transgenes on ecosystems. This work develops modeling tools for estimating gene flow from GM plant populations in boreal environments and for investigating the mechanisms of the gene flow process. To describe spatial dimensions of the gene flow, dispersal models are developed for the local and regional scale spread of pollen grains and seeds, with special emphasis on wind dispersal. This study provides tools for describing cross-pollination between GM and conventional populations and for estimating the levels of transgenic contamination of the conventional crops. For perennial populations, a modeling framework describing the dynamics of plants and genotypes is developed, in order to estimate the gene flow process over a sequence of years. The dispersal of airborne pollen and seeds cannot be easily controlled, and small amounts of these particles are likely to disperse over long distances. Wind dispersal processes are highly stochastic due to variation in atmospheric conditions, so that there may be considerable variation between individual dispersal patterns. This, in turn, is reflected to the large amount of variation in annual levels of cross-pollination between GM and conventional populations. Even though land-use practices have effects on the average levels of cross-pollination between GM and conventional fields, the level of transgenic contamination of a conventional crop remains highly stochastic. The demographic effects of a transgene have impacts on the establishment of trangenic plants amongst conventional genotypes of the same species. If the transgene gives a plant a considerable fitness advantage in comparison to conventional genotypes, the spread of transgenes to conventional population can be strongly increased. In such cases, dominance of the transgene considerably increases gene flow from GM to conventional populations, due to the enhanced fitness of heterozygous hybrids. The fitness of GM plants in conventional populations can be reduced by linking the selectively favoured primary transgene to a disfavoured mitigation transgene. Recombination between these transgenes is a major risk related to this technique, especially because it tends to take place amongst the conventional genotypes and thus promotes the establishment of invasive transgenic plants in conventional populations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis addresses modeling of financial time series, especially stock market returns and daily price ranges. Modeling data of this kind can be approached with so-called multiplicative error models (MEM). These models nest several well known time series models such as GARCH, ACD and CARR models. They are able to capture many well established features of financial time series including volatility clustering and leptokurtosis. In contrast to these phenomena, different kinds of asymmetries have received relatively little attention in the existing literature. In this thesis asymmetries arise from various sources. They are observed in both conditional and unconditional distributions, for variables with non-negative values and for variables that have values on the real line. In the multivariate context asymmetries can be observed in the marginal distributions as well as in the relationships of the variables modeled. New methods for all these cases are proposed. Chapter 2 considers GARCH models and modeling of returns of two stock market indices. The chapter introduces the so-called generalized hyperbolic (GH) GARCH model to account for asymmetries in both conditional and unconditional distribution. In particular, two special cases of the GARCH-GH model which describe the data most accurately are proposed. They are found to improve the fit of the model when compared to symmetric GARCH models. The advantages of accounting for asymmetries are also observed through Value-at-Risk applications. Both theoretical and empirical contributions are provided in Chapter 3 of the thesis. In this chapter the so-called mixture conditional autoregressive range (MCARR) model is introduced, examined and applied to daily price ranges of the Hang Seng Index. The conditions for the strict and weak stationarity of the model as well as an expression for the autocorrelation function are obtained by writing the MCARR model as a first order autoregressive process with random coefficients. The chapter also introduces inverse gamma (IG) distribution to CARR models. The advantages of CARR-IG and MCARR-IG specifications over conventional CARR models are found in the empirical application both in- and out-of-sample. Chapter 4 discusses the simultaneous modeling of absolute returns and daily price ranges. In this part of the thesis a vector multiplicative error model (VMEM) with asymmetric Gumbel copula is found to provide substantial benefits over the existing VMEM models based on elliptical copulas. The proposed specification is able to capture the highly asymmetric dependence of the modeled variables thereby improving the performance of the model considerably. The economic significance of the results obtained is established when the information content of the volatility forecasts derived is examined.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The stochastic filtering has been in general an estimation of indirectly observed states given observed data. This means that one is discussing conditional expected values as being one of the most accurate estimation, given the observations in the context of probability space. In my thesis, I have presented the theory of filtering using two different kind of observation process: the first one is a diffusion process which is discussed in the first chapter, while the third chapter introduces the latter which is a counting process. The majority of the fundamental results of the stochastic filtering is stated in form of interesting equations, such the unnormalized Zakai equation that leads to the Kushner-Stratonovich equation. The latter one which is known also by the normalized Zakai equation or equally by Fujisaki-Kallianpur-Kunita (FKK) equation, shows the divergence between the estimate using a diffusion process and a counting process. I have also introduced an example for the linear gaussian case, which is mainly the concept to build the so-called Kalman-Bucy filter. As the unnormalized and the normalized Zakai equations are in terms of the conditional distribution, a density of these distributions will be developed through these equations and stated by Kushner Theorem. However, Kushner Theorem has a form of a stochastic partial differential equation that needs to be verify in the sense of the existence and uniqueness of its solution, which is covered in the second chapter.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Minimum Description Length (MDL) is an information-theoretic principle that can be used for model selection and other statistical inference tasks. There are various ways to use the principle in practice. One theoretically valid way is to use the normalized maximum likelihood (NML) criterion. Due to computational difficulties, this approach has not been used very often. This thesis presents efficient floating-point algorithms that make it possible to compute the NML for multinomial, Naive Bayes and Bayesian forest models. None of the presented algorithms rely on asymptotic analysis and with the first two model classes we also discuss how to compute exact rational number solutions.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In visual object detection and recognition, classifiers have two interesting characteristics: accuracy and speed. Accuracy depends on the complexity of the image features and classifier decision surfaces. Speed depends on the hardware and the computational effort required to use the features and decision surfaces. When attempts to increase accuracy lead to increases in complexity and effort, it is necessary to ask how much are we willing to pay for increased accuracy. For example, if increased computational effort implies quickly diminishing returns in accuracy, then those designing inexpensive surveillance applications cannot aim for maximum accuracy at any cost. It becomes necessary to find trade-offs between accuracy and effort. We study efficient classification of images depicting real-world objects and scenes. Classification is efficient when a classifier can be controlled so that the desired trade-off between accuracy and effort (speed) is achieved and unnecessary computations are avoided on a per input basis. A framework is proposed for understanding and modeling efficient classification of images. Classification is modeled as a tree-like process. In designing the framework, it is important to recognize what is essential and to avoid structures that are narrow in applicability. Earlier frameworks are lacking in this regard. The overall contribution is two-fold. First, the framework is presented, subjected to experiments, and shown to be satisfactory. Second, certain unconventional approaches are experimented with. This allows the separation of the essential from the conventional. To determine if the framework is satisfactory, three categories of questions are identified: trade-off optimization, classifier tree organization, and rules for delegation and confidence modeling. Questions and problems related to each category are addressed and empirical results are presented. For example, related to trade-off optimization, we address the problem of computational bottlenecks that limit the range of trade-offs. We also ask if accuracy versus effort trade-offs can be controlled after training. For another example, regarding classifier tree organization, we first consider the task of organizing a tree in a problem-specific manner. We then ask if problem-specific organization is necessary.