8 resultados para Bayesian model selection

em Duke University


Relevância:

100.00% 100.00%

Publicador:

Resumo:

An abstract of a thesis devoted to using helix-coil models to study unfolded states.\\

Research on polypeptide unfolded states has received much more attention in the last decade or so than it has in the past. Unfolded states are thought to be implicated in various

misfolding diseases and likely play crucial roles in protein folding equilibria and folding rates. Structural characterization of unfolded states has proven to be

much more difficult than the now well established practice of determining the structures of folded proteins. This is largely because many core assumptions underlying

folded structure determination methods are invalid for unfolded states. This has led to a dearth of knowledge concerning the nature of unfolded state conformational

distributions. While many aspects of unfolded state structure are not well known, there does exist a significant body of work stretching back half a century that

has been focused on structural characterization of marginally stable polypeptide systems. This body of work represents an extensive collection of experimental

data and biophysical models associated with describing helix-coil equilibria in polypeptide systems. Much of the work on unfolded states in the last decade has not been devoted

specifically to the improvement of our understanding of helix-coil equilibria, which arguably is the most well characterized of the various conformational equilibria

that likely contribute to unfolded state conformational distributions. This thesis seeks to provide a deeper investigation of helix-coil equilibria using modern

statistical data analysis and biophysical modeling techniques. The studies contained within seek to provide deeper insights and new perspectives on what we presumably

know very well about protein unfolded states. \\

Chapter 1 gives an overview of recent and historical work on studying protein unfolded states. The study of helix-coil equilibria is placed in the context

of the general field of unfolded state research and the basics of helix-coil models are introduced.\\

Chapter 2 introduces the newest incarnation of a sophisticated helix-coil model. State of the art modern statistical techniques are employed to estimate the energies

of various physical interactions that serve to influence helix-coil equilibria. A new Bayesian model selection approach is utilized to test many long-standing

hypotheses concerning the physical nature of the helix-coil transition. Some assumptions made in previous models are shown to be invalid and the new model

exhibits greatly improved predictive performance relative to its predecessor. \\

Chapter 3 introduces a new statistical model that can be used to interpret amide exchange measurements. As amide exchange can serve as a probe for residue-specific

properties of helix-coil ensembles, the new model provides a novel and robust method to use these types of measurements to characterize helix-coil ensembles experimentally

and test the position-specific predictions of helix-coil models. The statistical model is shown to perform exceedingly better than the most commonly used

method for interpreting amide exchange data. The estimates of the model obtained from amide exchange measurements on an example helical peptide

also show a remarkable consistency with the predictions of the helix-coil model. \\

Chapter 4 involves a study of helix-coil ensembles through the enumeration of helix-coil configurations. Aside from providing new insights into helix-coil ensembles,

this chapter also introduces a new method by which helix-coil models can be extended to calculate new types of observables. Future work on this approach could potentially

allow helix-coil models to move into use domains that were previously inaccessible and reserved for other types of unfolded state models that were introduced in chapter 1.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.

While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.

For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.

Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Testing for differences within data sets is an important issue across various applications. Our work is primarily motivated by the analysis of microbiomial composition, which has been increasingly relevant and important with the rise of DNA sequencing. We first review classical frequentist tests that are commonly used in tackling such problems. We then propose a Bayesian Dirichlet-multinomial framework for modeling the metagenomic data and for testing underlying differences between the samples. A parametric Dirichlet-multinomial model uses an intuitive hierarchical structure that allows for flexibility in characterizing both the within-group variation and the cross-group difference and provides very interpretable parameters. A computational method for evaluating the marginal likelihoods under the null and alternative hypotheses is also given. Through simulations, we show that our Bayesian model performs competitively against frequentist counterparts. We illustrate the method through analyzing metagenomic applications using the Human Microbiome Project data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Mixtures of Zellner's g-priors have been studied extensively in linear models and have been shown to have numerous desirable properties for Bayesian variable selection and model averaging. Several extensions of g-priors to Generalized Linear Models (GLMs) have been proposed in the literature; however, the choice of prior distribution of g and resulting properties for inference have received considerably less attention. In this paper, we extend mixtures of g-priors to GLMs by assigning the truncated Compound Confluent Hypergeometric (tCCH) distribution to 1/(1+g) and illustrate how this prior distribution encompasses several special cases of mixtures of g-priors in the literature, such as the Hyper-g, truncated Gamma, Beta-prime, and the Robust prior. Under an integrated Laplace approximation to the likelihood, the posterior distribution of 1/(1+g) is in turn a tCCH distribution, and approximate marginal likelihoods are thus available analytically. We discuss the local geometric properties of the g-prior in GLMs and show that specific choices of the hyper-parameters satisfy the various desiderata for model selection proposed by Bayarri et al, such as asymptotic model selection consistency, information consistency, intrinsic consistency, and measurement invariance. We also illustrate inference using these priors and contrast them to others in the literature via simulation and real examples.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

At least since the seminal works of Jacob Mincer, labor economists have sought to understand how students make higher education investment decisions. Mincer’s original work seeks to understand how students decide how much education to accrue; subsequent work by various authors seeks to understand how students choose where to attend college, what field to major in, and whether to drop out of college.

Broadly speaking, this rich sub-field of literature contributes to society in two ways: First, it provides a better understanding of important social behaviors. Second, it helps policymakers anticipate the responses of students when evaluating various policy reforms.

While research on the higher education investment decisions of students has had an enormous impact on our understanding of society and has shaped countless education policies, students are only one interested party in the higher education landscape. In the jargon of economists, students represent only the `demand side’ of higher education---customers who are choosing options from a set of available alternatives. Opposite students are instructors and administrators who represent the `supply side’ of higher education---those who decide which options are available to students.

For similar reasons, it is also important to understand how individuals on the supply side of education make decisions: First, this provides a deeper understanding of the behaviors of important social institutions. Second, it helps policymakers anticipate the responses of instructors and administrators when evaluating various reforms. However, while there is substantial literature understanding decisions made on the demand side of education, there is far less attention paid to decisions on the supply side of education.

This dissertation uses empirical evidence to better understand how instructors and administrators make decisions and the implications of these decisions for students.

In the first chapter, I use data from Duke University and a Bayesian model of correlated learning to measure the signal quality of grades across academic fields. The correlated feature of the model allows grades in one academic field to signal ability in all other fields allowing me to measure both ‘own category' signal quality and ‘spillover' signal quality. Estimates reveal a clear division between information rich Science, Engineering, and Economics grades and less informative Humanities and Social Science grades. In many specifications, information spillovers are so powerful that precise Science, Engineering, and Economics grades are more informative about Humanities and Social Science abilities than Humanities and Social Science grades. This suggests students who take engineering courses during their Freshman year make more informed specialization decisions later in college.

In the second chapter, I use data from the University of Central Arkansas to understand how universities decide which courses to offer and how much to spend on instructors for these courses. Course offerings and instructor characteristics directly affect the courses students choose and the value they receive from these choices. This chapter reveals the university preferences over these student outcomes which best explain observed course offerings and instructors. This allows me to assess whether university incentives are aligned with students, to determine what alternative university choices would be preferred by students, and to illustrate how a revenue neutral tax/subsidy policy can induce a university to make these student-best decisions.

In the third chapter, co-authored with Thomas Ahn, Peter Arcidiacono, and Amy Hopson, we use data from the University of Kentucky to understand how instructors choose grading policies. In this chapter, we estimate an equilibrium model in which instructors choose grading policies and students choose courses and study effort given grading policies. In this model, instructors set both a grading intercept and a return on ability and effort. This builds a rich link between the grading policy decisions of instructors and the course choices of students. We use estimates of this model to infer what preference parameters best explain why instructors chose estimated grading policies. To illustrate the importance of these supply side decisions, we show changing grading policies can substantially reduce the gender gap in STEM enrollment.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The work presented in this dissertation is focused on applying engineering methods to develop and explore probabilistic survival models for the prediction of decompression sickness in US NAVY divers. Mathematical modeling, computational model development, and numerical optimization techniques were employed to formulate and evaluate the predictive quality of models fitted to empirical data. In Chapters 1 and 2 we present general background information relevant to the development of probabilistic models applied to predicting the incidence of decompression sickness. The remainder of the dissertation introduces techniques developed in an effort to improve the predictive quality of probabilistic decompression models and to reduce the difficulty of model parameter optimization.

The first project explored seventeen variations of the hazard function using a well-perfused parallel compartment model. Models were parametrically optimized using the maximum likelihood technique. Model performance was evaluated using both classical statistical methods and model selection techniques based on information theory. Optimized model parameters were overall similar to those of previously published Results indicated that a novel hazard function definition that included both ambient pressure scaling and individually fitted compartment exponent scaling terms.

We developed ten pharmacokinetic compartmental models that included explicit delay mechanics to determine if predictive quality could be improved through the inclusion of material transfer lags. A fitted discrete delay parameter augmented the inflow to the compartment systems from the environment. Based on the observation that symptoms are often reported after risk accumulation begins for many of our models, we hypothesized that the inclusion of delays might improve correlation between the model predictions and observed data. Model selection techniques identified two models as having the best overall performance, but comparison to the best performing model without delay and model selection using our best identified no delay pharmacokinetic model both indicated that the delay mechanism was not statistically justified and did not substantially improve model predictions.

Our final investigation explored parameter bounding techniques to identify parameter regions for which statistical model failure will not occur. When a model predicts a no probability of a diver experiencing decompression sickness for an exposure that is known to produce symptoms, statistical model failure occurs. Using a metric related to the instantaneous risk, we successfully identify regions where model failure will not occur and identify the boundaries of the region using a root bounding technique. Several models are used to demonstrate the techniques, which may be employed to reduce the difficulty of model optimization for future investigations.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The dynamics of a population undergoing selection is a central topic in evolutionary biology. This question is particularly intriguing in the case where selective forces act in opposing directions at two population scales. For example, a fast-replicating virus strain outcompetes slower-replicating strains at the within-host scale. However, if the fast-replicating strain causes host morbidity and is less frequently transmitted, it can be outcompeted by slower-replicating strains at the between-host scale. Here we consider a stochastic ball-and-urn process which models this type of phenomenon. We prove the weak convergence of this process under two natural scalings. The first scaling leads to a deterministic nonlinear integro-partial differential equation on the interval $[0,1]$ with dependence on a single parameter, $\lambda$. We show that the fixed points of this differential equation are Beta distributions and that their stability depends on $\lambda$ and the behavior of the initial data around $1$. The second scaling leads to a measure-valued Fleming-Viot process, an infinite dimensional stochastic process that is frequently associated with a population genetics.