7 resultados para Bayesian Normal Mixture Model, Data Binning, Data Analysis

em Duke University


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.

Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multi-output Gaussian processes provide a convenient framework for multi-task problems. An illustrative and motivating example of a multi-task problem is multi-region electrophysiological time-series data, where experimentalists are interested in both power and phase coherence between channels. Recently, the spectral mixture (SM) kernel was proposed to model the spectral density of a single task in a Gaussian process framework. This work develops a novel covariance kernel for multiple outputs, called the cross-spectral mixture (CSM) kernel. This new, flexible kernel represents both the power and phase relationship between multiple observation channels. The expressive capabilities of the CSM kernel are demonstrated through implementation of 1) a Bayesian hidden Markov model, where the emission distribution is a multi-output Gaussian process with a CSM covariance kernel, and 2) a Gaussian process factor analysis model, where factor scores represent the utilization of cross-spectral neural circuits. Results are presented for measured multi-region electrophysiological data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Bayesian nonparametric models, such as the Gaussian process and the Dirichlet process, have been extensively applied for target kinematics modeling in various applications including environmental monitoring, traffic planning, endangered species tracking, dynamic scene analysis, autonomous robot navigation, and human motion modeling. As shown by these successful applications, Bayesian nonparametric models are able to adjust their complexities adaptively from data as necessary, and are resistant to overfitting or underfitting. However, most existing works assume that the sensor measurements used to learn the Bayesian nonparametric target kinematics models are obtained a priori or that the target kinematics can be measured by the sensor at any given time throughout the task. Little work has been done for controlling the sensor with bounded field of view to obtain measurements of mobile targets that are most informative for reducing the uncertainty of the Bayesian nonparametric models. To present the systematic sensor planning approach to leaning Bayesian nonparametric models, the Gaussian process target kinematics model is introduced at first, which is capable of describing time-invariant spatial phenomena, such as ocean currents, temperature distributions and wind velocity fields. The Dirichlet process-Gaussian process target kinematics model is subsequently discussed for modeling mixture of mobile targets, such as pedestrian motion patterns.

Novel information theoretic functions are developed for these introduced Bayesian nonparametric target kinematics models to represent the expected utility of measurements as a function of sensor control inputs and random environmental variables. A Gaussian process expected Kullback Leibler divergence is developed as the expectation of the KL divergence between the current (prior) and posterior Gaussian process target kinematics models with respect to the future measurements. Then, this approach is extended to develop a new information value function that can be used to estimate target kinematics described by a Dirichlet process-Gaussian process mixture model. A theorem is proposed that shows the novel information theoretic functions are bounded. Based on this theorem, efficient estimators of the new information theoretic functions are designed, which are proved to be unbiased with the variance of the resultant approximation error decreasing linearly as the number of samples increases. Computational complexities for optimizing the novel information theoretic functions under sensor dynamics constraints are studied, and are proved to be NP-hard. A cumulative lower bound is then proposed to reduce the computational complexity to polynomial time.

Three sensor planning algorithms are developed according to the assumptions on the target kinematics and the sensor dynamics. For problems where the control space of the sensor is discrete, a greedy algorithm is proposed. The efficiency of the greedy algorithm is demonstrated by a numerical experiment with data of ocean currents obtained by moored buoys. A sweep line algorithm is developed for applications where the sensor control space is continuous and unconstrained. Synthetic simulations as well as physical experiments with ground robots and a surveillance camera are conducted to evaluate the performance of the sweep line algorithm. Moreover, a lexicographic algorithm is designed based on the cumulative lower bound of the novel information theoretic functions, for the scenario where the sensor dynamics are constrained. Numerical experiments with real data collected from indoor pedestrians by a commercial pan-tilt camera are performed to examine the lexicographic algorithm. Results from both the numerical simulations and the physical experiments show that the three sensor planning algorithms proposed in this dissertation based on the novel information theoretic functions are superior at learning the target kinematics with

little or no prior knowledge

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract

Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.

The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.

The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.

The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thermodynamic stability measurements on proteins and protein-ligand complexes can offer insights not only into the fundamental properties of protein folding reactions and protein functions, but also into the development of protein-directed therapeutic agents to combat disease. Conventional calorimetric or spectroscopic approaches for measuring protein stability typically require large amounts of purified protein. This requirement has precluded their use in proteomic applications. Stability of Proteins from Rates of Oxidation (SPROX) is a recently developed mass spectrometry-based approach for proteome-wide thermodynamic stability analysis. Since the proteomic coverage of SPROX is fundamentally limited by the detection of methionine-containing peptides, the use of tryptophan-containing peptides was investigated in this dissertation. A new SPROX-like protocol was developed that measured protein folding free energies using the denaturant dependence of the rate at which globally protected tryptophan and methionine residues are modified with dimethyl (2-hydroxyl-5-nitrobenzyl) sulfonium bromide and hydrogen peroxide, respectively. This so-called Hybrid protocol was applied to proteins in yeast and MCF-7 cell lysates and achieved a ~50% increase in proteomic coverage compared to probing only methionine-containing peptides. Subsequently, the Hybrid protocol was successfully utilized to identify and quantify both known and novel protein-ligand interactions in cell lysates. The ligands under study included the well-known Hsp90 inhibitor geldanamycin and the less well-understood omeprazole sulfide that inhibits liver-stage malaria. In addition to protein-small molecule interactions, protein-protein interactions involving Puf6 were investigated using the SPROX technique in comparative thermodynamic analyses performed on wild-type and Puf6-deletion yeast strains. A total of 39 proteins were detected as Puf6 targets and 36 of these targets were previously unknown to interact with Puf6. Finally, to facilitate the SPROX/Hybrid data analysis process and minimize human errors, a Bayesian algorithm was developed for transition midpoint assignment. In summary, the work in this dissertation expanded the scope of SPROX and evaluated the use of SPROX/Hybrid protocols for characterizing protein-ligand interactions in complex biological mixtures.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An abstract of a thesis devoted to using helix-coil models to study unfolded states.\\

Research on polypeptide unfolded states has received much more attention in the last decade or so than it has in the past. Unfolded states are thought to be implicated in various

misfolding diseases and likely play crucial roles in protein folding equilibria and folding rates. Structural characterization of unfolded states has proven to be

much more difficult than the now well established practice of determining the structures of folded proteins. This is largely because many core assumptions underlying

folded structure determination methods are invalid for unfolded states. This has led to a dearth of knowledge concerning the nature of unfolded state conformational

distributions. While many aspects of unfolded state structure are not well known, there does exist a significant body of work stretching back half a century that

has been focused on structural characterization of marginally stable polypeptide systems. This body of work represents an extensive collection of experimental

data and biophysical models associated with describing helix-coil equilibria in polypeptide systems. Much of the work on unfolded states in the last decade has not been devoted

specifically to the improvement of our understanding of helix-coil equilibria, which arguably is the most well characterized of the various conformational equilibria

that likely contribute to unfolded state conformational distributions. This thesis seeks to provide a deeper investigation of helix-coil equilibria using modern

statistical data analysis and biophysical modeling techniques. The studies contained within seek to provide deeper insights and new perspectives on what we presumably

know very well about protein unfolded states. \\

Chapter 1 gives an overview of recent and historical work on studying protein unfolded states. The study of helix-coil equilibria is placed in the context

of the general field of unfolded state research and the basics of helix-coil models are introduced.\\

Chapter 2 introduces the newest incarnation of a sophisticated helix-coil model. State of the art modern statistical techniques are employed to estimate the energies

of various physical interactions that serve to influence helix-coil equilibria. A new Bayesian model selection approach is utilized to test many long-standing

hypotheses concerning the physical nature of the helix-coil transition. Some assumptions made in previous models are shown to be invalid and the new model

exhibits greatly improved predictive performance relative to its predecessor. \\

Chapter 3 introduces a new statistical model that can be used to interpret amide exchange measurements. As amide exchange can serve as a probe for residue-specific

properties of helix-coil ensembles, the new model provides a novel and robust method to use these types of measurements to characterize helix-coil ensembles experimentally

and test the position-specific predictions of helix-coil models. The statistical model is shown to perform exceedingly better than the most commonly used

method for interpreting amide exchange data. The estimates of the model obtained from amide exchange measurements on an example helical peptide

also show a remarkable consistency with the predictions of the helix-coil model. \\

Chapter 4 involves a study of helix-coil ensembles through the enumeration of helix-coil configurations. Aside from providing new insights into helix-coil ensembles,

this chapter also introduces a new method by which helix-coil models can be extended to calculate new types of observables. Future work on this approach could potentially

allow helix-coil models to move into use domains that were previously inaccessible and reserved for other types of unfolded state models that were introduced in chapter 1.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The six-layered neuron structure in the cerebral cortex is the foundation for human mental abilities. In the developing cerebral cortex, neural stem cells undergo proliferation and differentiate into intermediate progenitors and neurons, a process known as embryonic neurogenesis. Disrupted embryonic neurogenesis is the root cause of a wide range of neurodevelopmental disorders, including microcephaly and intellectual disabilities. Multiple layers of regulatory networks have been identified and extensively studied over the past decades to understand this complex but extremely crucial process of brain development. In recent years, post-transcriptional RNA regulation through RNA binding proteins has emerged as a critical regulatory nexus in embryonic neurogenesis. The exon junction complex (EJC) is a highly conserved RNA binding complex composed of four core proteins, Magoh, Rbm8a, Eif4a3, and Casc3. The EJC plays a major role in regulating RNA splicing, nuclear export, subcellular localization, translation, and nonsense mediated RNA decay. Human genetic studies have associated individual EJC components with various developmental disorders. We showed previously that haploinsufficiency of Magoh causes microcephaly and disrupted neural stem cell differentiation in mouse. However, it is unclear if other EJC core components are also required for embryonic neurogenesis. More importantly, the molecular mechanism through which the EJC regulates embryonic neurogenesis remains largely unknown. Here, we demonstrated with genetically modified mouse models that both Rbm8a and Eif4a3 are required for proper embryonic neurogenesis and the formation of a normal brain. Using transcriptome and proteomic analysis, we showed that the EJC posttranscriptionally regulates genes involved in the p53 pathway, splicing and translation regulation, as well as ribosomal biogenesis. This is the first in vivo evidence suggesting that the etiology of EJC associated neurodevelopmental diseases can be ribosomopathies. We also showed that, different from other EJC core components, depletion of Casc3 only led to mild neurogenesis defects in the mouse model. However, our data suggested that Casc3 is required for embryo viability, development progression, and is potentially a regulator of cardiac development. Together, data presented in this thesis suggests that the EJC is crucial for embryonic neurogenesis and that the EJC and its peripheral factors may regulate development in a tissue-specific manner.