11 resultados para state-space methods

em Duke University


Relevância:

100.00% 100.00%

Publicador:

Resumo:

We describe a strategy for Markov chain Monte Carlo analysis of non-linear, non-Gaussian state-space models involving batch analysis for inference on dynamic, latent state variables and fixed model parameters. The key innovation is a Metropolis-Hastings method for the time series of state variables based on sequential approximation of filtering and smoothing densities using normal mixtures. These mixtures are propagated through the non-linearities using an accurate, local mixture approximation method, and we use a regenerating procedure to deal with potential degeneracy of mixture components. This provides accurate, direct approximations to sequential filtering and retrospective smoothing distributions, and hence a useful construction of global Metropolis proposal distributions for simulation of posteriors for the set of states. This analysis is embedded within a Gibbs sampler to include uncertain fixed parameters. We give an example motivated by an application in systems biology. Supplemental materials provide an example based on a stochastic volatility model as well as MATLAB code.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.

We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.

We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.

Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.

This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Multi-output Gaussian processes provide a convenient framework for multi-task problems. An illustrative and motivating example of a multi-task problem is multi-region electrophysiological time-series data, where experimentalists are interested in both power and phase coherence between channels. Recently, the spectral mixture (SM) kernel was proposed to model the spectral density of a single task in a Gaussian process framework. This work develops a novel covariance kernel for multiple outputs, called the cross-spectral mixture (CSM) kernel. This new, flexible kernel represents both the power and phase relationship between multiple observation channels. The expressive capabilities of the CSM kernel are demonstrated through implementation of 1) a Bayesian hidden Markov model, where the emission distribution is a multi-output Gaussian process with a CSM covariance kernel, and 2) a Gaussian process factor analysis model, where factor scores represent the utilization of cross-spectral neural circuits. Results are presented for measured multi-region electrophysiological data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Free energy calculations are a computational method for determining thermodynamic quantities, such as free energies of binding, via simulation.

Currently, due to computational and algorithmic limitations, free energy calculations are limited in scope.

In this work, we propose two methods for improving the efficiency of free energy calculations.

First, we expand the state space of alchemical intermediates, and show that this expansion enables us to calculate free energies along lower variance paths.

We use Q-learning, a reinforcement learning technique, to discover and optimize paths at low computational cost.

Second, we reduce the cost of sampling along a given path by using sequential Monte Carlo samplers.

We develop a new free energy estimator, pCrooks (pairwise Crooks), a variant on the Crooks fluctuation theorem (CFT), which enables decomposition of the variance of the free energy estimate for discrete paths, while retaining beneficial characteristics of CFT.

Combining these two advancements, we show that for some test models, optimal expanded-space paths have a nearly 80% reduction in variance relative to the standard path.

Additionally, our free energy estimator converges at a more consistent rate and on average 1.8 times faster when we enable path searching, even when the cost of path discovery and refinement is considered.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper focuses on the nature of jamming, as seen in two-dimensional frictional granular systems consisting of photoelastic particles. The photoelastic technique is unique at this time, in its capability to provide detailed particle-scale information on forces and kinematic quantities such as particle displacements and rotations. These experiments first explore isotropic stress states near point J through measurements of the mean contact number per particle, Z, and the pressure, P as functions of the packing fraction, . In this case, the experiments show some but not all aspects of jamming, as expected on the basis of simulations and models that typically assume conservative, hence frictionless, forces between particles. Specifically, there is a rapid growth in Z, at a reasonable which we identify with as c. It is possible to fit Z and P, to power law expressions in - c above c, and to obtain exponents that are in agreement with simulations and models. However, the experiments differ from theory on several points, as typified by the rounding that is observed in Z and P near c. The application of shear to these same 2D granular systems leads to phenomena that are qualitatively different from the standard picture of jamming. In particular, there is a range of packing fractions below c, where the application of shear strain at constant leads to jammed stress-anisotropic states, i.e. they have a non-zero shear stress, τ. The application of shear strain to an initially isotropically compressed (hence jammed) state, does not lead to an unjammed state per se. Rather, shear strain at constant first leads to an increase of both τ and P. Additional strain leads to a succession of jammed states interspersed with relatively localized failures of the force network leading to other stress-anisotropic states that are jammed at typically somewhat lower stress. The locus of jammed states requires a state space that involves not only and τ, but also P. P, τ, and Z are all hysteretic functions of shear strain for fixed . However, we find that both P and τ are roughly linear functions of Z for strains large enough to jam the system. This implies that these shear-jammed states satisfy a Coulomb like-relation, τ = μP. © 2010 The Royal Society of Chemistry.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Numerical approximation of the long time behavior of a stochastic di.erential equation (SDE) is considered. Error estimates for time-averaging estimators are obtained and then used to show that the stationary behavior of the numerical method converges to that of the SDE. The error analysis is based on using an associated Poisson equation for the underlying SDE. The main advantages of this approach are its simplicity and universality. It works equally well for a range of explicit and implicit schemes, including those with simple simulation of random variables, and for hypoelliptic SDEs. To simplify the exposition, we consider only the case where the state space of the SDE is a torus, and we study only smooth test functions. However, we anticipate that the approach can be applied more widely. An analogy between our approach and Stein's method is indicated. Some practical implications of the results are discussed. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A class of multi-process models is developed for collections of time indexed count data. Autocorrelation in counts is achieved with dynamic models for the natural parameter of the binomial distribution. In addition to modeling binomial time series, the framework includes dynamic models for multinomial and Poisson time series. Markov chain Monte Carlo (MCMC) and Po ́lya-Gamma data augmentation (Polson et al., 2013) are critical for fitting multi-process models of counts. To facilitate computation when the counts are high, a Gaussian approximation to the P ́olya- Gamma random variable is developed.

Three applied analyses are presented to explore the utility and versatility of the framework. The first analysis develops a model for complex dynamic behavior of themes in collections of text documents. Documents are modeled as a “bag of words”, and the multinomial distribution is used to characterize uncertainty in the vocabulary terms appearing in each document. State-space models for the natural parameters of the multinomial distribution induce autocorrelation in themes and their proportional representation in the corpus over time.

The second analysis develops a dynamic mixed membership model for Poisson counts. The model is applied to a collection of time series which record neuron level firing patterns in rhesus monkeys. The monkey is exposed to two sounds simultaneously, and Gaussian processes are used to smoothly model the time-varying rate at which the neuron’s firing pattern fluctuates between features associated with each sound in isolation.

The third analysis presents a switching dynamic generalized linear model for the time-varying home run totals of professional baseball players. The model endows each player with an age specific latent natural ability class and a performance enhancing drug (PED) use indicator. As players age, they randomly transition through a sequence of ability classes in a manner consistent with traditional aging patterns. When the performance of the player significantly deviates from the expected aging pattern, he is identified as a player whose performance is consistent with PED use.

All three models provide a mechanism for sharing information across related series locally in time. The models are fit with variations on the P ́olya-Gamma Gibbs sampler, MCMC convergence diagnostics are developed, and reproducible inference is emphasized throughout the dissertation.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.

Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This dissertation examined the response to termination of CO2 enrichment of a forest ecosystem exposed to long-term elevated atmospheric CO2 condition, and aimed at investigating responses and their underlying mechanisms of two important factors of carbon cycle in the ecosystem, stomatal conductance and soil respiration. Because the contribution of understory vegetation to the entire ecosystem grew with time, we first investigated the effect of elevated CO2 on understory vegetation. Potential growth enhancing effect of elevated CO2 were not observed, and light seemed to be a limiting factor. Secondly, we examined the importance of aerodynamic conductance to determine canopy conductance, and found that its effect can be negligible. Responses of stomatal conductance and soil respiration were assessed using Bayesian state space model. In two years after the termination of CO2 enrichment, stomatal conductance in formerly elevated CO2 returned to ambient level, while soil respiration became smaller than ambient level and did not recovered to ambient in two years.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

BACKGROUND: Dropouts and missing data are nearly-ubiquitous in obesity randomized controlled trails, threatening validity and generalizability of conclusions. Herein, we meta-analytically evaluate the extent of missing data, the frequency with which various analytic methods are employed to accommodate dropouts, and the performance of multiple statistical methods. METHODOLOGY/PRINCIPAL FINDINGS: We searched PubMed and Cochrane databases (2000-2006) for articles published in English and manually searched bibliographic references. Articles of pharmaceutical randomized controlled trials with weight loss or weight gain prevention as major endpoints were included. Two authors independently reviewed each publication for inclusion. 121 articles met the inclusion criteria. Two authors independently extracted treatment, sample size, drop-out rates, study duration, and statistical method used to handle missing data from all articles and resolved disagreements by consensus. In the meta-analysis, drop-out rates were substantial with the survival (non-dropout) rates being approximated by an exponential decay curve (e(-lambdat)) where lambda was estimated to be .0088 (95% bootstrap confidence interval: .0076 to .0100) and t represents time in weeks. The estimated drop-out rate at 1 year was 37%. Most studies used last observation carried forward as the primary analytic method to handle missing data. We also obtained 12 raw obesity randomized controlled trial datasets for empirical analyses. Analyses of raw randomized controlled trial data suggested that both mixed models and multiple imputation performed well, but that multiple imputation may be more robust when missing data are extensive. CONCLUSION/SIGNIFICANCE: Our analysis offers an equation for predictions of dropout rates useful for future study planning. Our raw data analyses suggests that multiple imputation is better than other methods for handling missing data in obesity randomized controlled trials, followed closely by mixed models. We suggest these methods supplant last observation carried forward as the primary method of analysis.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Recent research into resting-state functional magnetic resonance imaging (fMRI) has shown that the brain is very active during rest. This thesis work utilizes blood oxygenation level dependent (BOLD) signals to investigate the spatial and temporal functional network information found within resting-state data, and aims to investigate the feasibility of extracting functional connectivity networks using different methods as well as the dynamic variability within some of the methods. Furthermore, this work looks into producing valid networks using a sparsely-sampled sub-set of the original data.

In this work we utilize four main methods: independent component analysis (ICA), principal component analysis (PCA), correlation, and a point-processing technique. Each method comes with unique assumptions, as well as strengths and limitations into exploring how the resting state components interact in space and time.

Correlation is perhaps the simplest technique. Using this technique, resting-state patterns can be identified based on how similar the time profile is to a seed region’s time profile. However, this method requires a seed region and can only identify one resting state network at a time. This simple correlation technique is able to reproduce the resting state network using subject data from one subject’s scan session as well as with 16 subjects.

Independent component analysis, the second technique, has established software programs that can be used to implement this technique. ICA can extract multiple components from a data set in a single analysis. The disadvantage is that the resting state networks it produces are all independent of each other, making the assumption that the spatial pattern of functional connectivity is the same across all the time points. ICA is successfully able to reproduce resting state connectivity patterns for both one subject and a 16 subject concatenated data set.

Using principal component analysis, the dimensionality of the data is compressed to find the directions in which the variance of the data is most significant. This method utilizes the same basic matrix math as ICA with a few important differences that will be outlined later in this text. Using this method, sometimes different functional connectivity patterns are identifiable but with a large amount of noise and variability.

To begin to investigate the dynamics of the functional connectivity, the correlation technique is used to compare the first and second halves of a scan session. Minor differences are discernable between the correlation results of the scan session halves. Further, a sliding window technique is implemented to study the correlation coefficients through different sizes of correlation windows throughout time. From this technique it is apparent that the correlation level with the seed region is not static throughout the scan length.

The last method introduced, a point processing method, is one of the more novel techniques because it does not require analysis of the continuous time points. Here, network information is extracted based on brief occurrences of high or low amplitude signals within a seed region. Because point processing utilizes less time points from the data, the statistical power of the results is lower. There are also larger variations in DMN patterns between subjects. In addition to boosted computational efficiency, the benefit of using a point-process method is that the patterns produced for different seed regions do not have to be independent of one another.

This work compares four unique methods of identifying functional connectivity patterns. ICA is a technique that is currently used by many scientists studying functional connectivity patterns. The PCA technique is not optimal for the level of noise and the distribution of the data sets. The correlation technique is simple and obtains good results, however a seed region is needed and the method assumes that the DMN regions is correlated throughout the entire scan. Looking at the more dynamic aspects of correlation changing patterns of correlation were evident. The last point-processing method produces a promising results of identifying functional connectivity networks using only low and high amplitude BOLD signals.