15 resultados para Statistical models

em Duke University


Relevância:

70.00% 70.00%

Publicador:

Resumo:

This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components.We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplementalmaterials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. © 2010 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

X-ray crystallography is the predominant method for obtaining atomic-scale information about biological macromolecules. Despite the success of the technique, obtaining well diffracting crystals still critically limits going from protein to structure. In practice, the crystallization process proceeds through knowledge-informed empiricism. Better physico-chemical understanding remains elusive because of the large number of variables involved, hence little guidance is available to systematically identify solution conditions that promote crystallization. To help determine relationships between macromolecular properties and their crystallization propensity, we have trained statistical models on samples for 182 proteins supplied by the Northeast Structural Genomics consortium. Gaussian processes, which capture trends beyond the reach of linear statistical models, distinguish between two main physico-chemical mechanisms driving crystallization. One is characterized by low levels of side chain entropy and has been extensively reported in the literature. The other identifies specific electrostatic interactions not previously described in the crystallization context. Because evidence for two distinct mechanisms can be gleaned both from crystal contacts and from solution conditions leading to successful crystallization, the model offers future avenues for optimizing crystallization screens based on partial structural information. The availability of crystallization data coupled with structural outcomes analyzed through state-of-the-art statistical models may thus guide macromolecular crystallization toward a more rational basis.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Uncertainty quantification (UQ) is both an old and new concept. The current novelty lies in the interactions and synthesis of mathematical models, computer experiments, statistics, field/real experiments, and probability theory, with a particular emphasize on the large-scale simulations by computer models. The challenges not only come from the complication of scientific questions, but also from the size of the information. It is the focus in this thesis to provide statistical models that are scalable to massive data produced in computer experiments and real experiments, through fast and robust statistical inference.

Chapter 2 provides a practical approach for simultaneously emulating/approximating massive number of functions, with the application on hazard quantification of Soufri\`{e}re Hills volcano in Montserrate island. Chapter 3 discusses another problem with massive data, in which the number of observations of a function is large. An exact algorithm that is linear in time is developed for the problem of interpolation of Methylation levels. Chapter 4 and Chapter 5 are both about the robust inference of the models. Chapter 4 provides a new criteria robustness parameter estimation criteria and several ways of inference have been shown to satisfy such criteria. Chapter 5 develops a new prior that satisfies some more criteria and is thus proposed to use in practice.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we propose generalized sampling approaches for measuring a multi-dimensional object using a compact compound-eye imaging system called thin observation module by bound optics (TOMBO). This paper shows the proposed system model, physical examples, and simulations to verify TOMBO imaging using generalized sampling. In the system, an object is modulated and multiplied by a weight distribution with physical coding, and the coded optical signal is integrated on to a detector array. A numerical estimation algorithm employing a sparsity constraint is used for object reconstruction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Major depression in the elderly is associated with brain structural changes and vascular lesions. Changes in the subcortical regions of the limbic system have also been noted. Studies examining hippocampus volumetric differences in depression have shown variable results, possibly due to any volume differences being secondary to local shape changes rather than differences in the overall volume. Shape analysis offers the potential to detect such changes. The present study applied spherical harmonic (SPHARM) shape analysis to the left and right hippocampi of 61 elderly subjects with major depression and 43 non-depressed elderly subjects. Statistical models controlling for age, sex, and total cerebral volume showed a significant reduction in depressed compared with control subjects in the left hippocampus (F(1,103) = 5.26; p = 0.0240) but not right hippocampus volume (F(1,103) = 0.41; p = 0.5213). Shape analysis showed significant differences in the mid-body of the left (but not the right) hippocampus between depressed and controls. When the depressed group was dichotomized into those whose depression was remitted at time of imaging and those who were unremitted, the shape comparison showed remitted subjects to be indistinguishable from controls (both sides) while the unremitted subjects differed in the midbody and the lateral side near the head. Hippocampal volume showed no difference between controls and remitted subjects but nonremitted subjects had significantly smaller left hippocampal volumes with no significant group differences in the right hippocampus. These findings may provide support to other reports of neurogenic effects of antidepressants and their relation to successful treatment for depressive symptoms.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.

While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.

For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

A framework for adaptive and non-adaptive statistical compressive sensing is developed, where a statistical model replaces the standard sparsity model of classical compressive sensing. We propose within this framework optimal task-specific sensing protocols specifically and jointly designed for classification and reconstruction. A two-step adaptive sensing paradigm is developed, where online sensing is applied to detect the signal class in the first step, followed by a reconstruction step adapted to the detected class and the observed samples. The approach is based on information theory, here tailored for Gaussian mixture models (GMMs), where an information-theoretic objective relationship between the sensed signals and a representation of the specific task of interest is maximized. Experimental results using synthetic signals, Landsat satellite attributes, and natural images of different sizes and with different noise levels show the improvements achieved using the proposed framework when compared to more standard sensing protocols. The underlying formulation can be applied beyond GMMs, at the price of higher mathematical and computational complexity. © 1991-2012 IEEE.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: The rate of emergence of human pathogens is steadily increasing; most of these novel agents originate in wildlife. Bats, remarkably, are the natural reservoirs of many of the most pathogenic viruses in humans. There are two bat genome projects currently underway, a circumstance that promises to speed the discovery host factors important in the coevolution of bats with their viruses. These genomes, however, are not yet assembled and one of them will provide only low coverage, making the inference of most genes of immunological interest error-prone. Many more wildlife genome projects are underway and intend to provide only shallow coverage. RESULTS: We have developed a statistical method for the assembly of gene families from partial genomes. The method takes full advantage of the quality scores generated by base-calling software, incorporating them into a complete probabilistic error model, to overcome the limitation inherent in the inference of gene family members from partial sequence information. We validated the method by inferring the human IFNA genes from the genome trace archives, and used it to infer 61 type-I interferon genes, and single type-II interferon genes in the bats Pteropus vampyrus and Myotis lucifugus. We confirmed our inferences by direct cloning and sequencing of IFNA, IFNB, IFND, and IFNK in P. vampyrus, and by demonstrating transcription of some of the inferred genes by known interferon-inducing stimuli. CONCLUSION: The statistical trace assembler described here provides a reliable method for extracting information from the many available and forthcoming partial or shallow genome sequencing projects, thereby facilitating the study of a wider variety of organisms with ecological and biomedical significance to humans than would otherwise be possible.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: Dropouts and missing data are nearly-ubiquitous in obesity randomized controlled trails, threatening validity and generalizability of conclusions. Herein, we meta-analytically evaluate the extent of missing data, the frequency with which various analytic methods are employed to accommodate dropouts, and the performance of multiple statistical methods. METHODOLOGY/PRINCIPAL FINDINGS: We searched PubMed and Cochrane databases (2000-2006) for articles published in English and manually searched bibliographic references. Articles of pharmaceutical randomized controlled trials with weight loss or weight gain prevention as major endpoints were included. Two authors independently reviewed each publication for inclusion. 121 articles met the inclusion criteria. Two authors independently extracted treatment, sample size, drop-out rates, study duration, and statistical method used to handle missing data from all articles and resolved disagreements by consensus. In the meta-analysis, drop-out rates were substantial with the survival (non-dropout) rates being approximated by an exponential decay curve (e(-lambdat)) where lambda was estimated to be .0088 (95% bootstrap confidence interval: .0076 to .0100) and t represents time in weeks. The estimated drop-out rate at 1 year was 37%. Most studies used last observation carried forward as the primary analytic method to handle missing data. We also obtained 12 raw obesity randomized controlled trial datasets for empirical analyses. Analyses of raw randomized controlled trial data suggested that both mixed models and multiple imputation performed well, but that multiple imputation may be more robust when missing data are extensive. CONCLUSION/SIGNIFICANCE: Our analysis offers an equation for predictions of dropout rates useful for future study planning. Our raw data analyses suggests that multiple imputation is better than other methods for handling missing data in obesity randomized controlled trials, followed closely by mixed models. We suggest these methods supplant last observation carried forward as the primary method of analysis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We discuss a general approach to dynamic sparsity modeling in multivariate time series analysis. Time-varying parameters are linked to latent processes that are thresholded to induce zero values adaptively, providing natural mechanisms for dynamic variable inclusion/selection. We discuss Bayesian model specification, analysis and prediction in dynamic regressions, time-varying vector autoregressions, and multivariate volatility models using latent thresholding. Application to a topical macroeconomic time series problem illustrates some of the benefits of the approach in terms of statistical and economic interpretations as well as improved predictions. Supplementary materials for this article are available online. © 2013 Copyright Taylor and Francis Group, LLC.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The authors address the 4 main points in S. M. Monroe and S. Mineka's (2008) comment. First, the authors show that the Diagnostic and Statistical Manual of Mental Disorders (4th ed., text rev.; American Psychiatric Association, 2000) posttraumatic stress disorder (PTSD) diagnosis includes an etiology and that it is based on a theoretical model with a distinguished history in psychology and psychiatry. Two tenets of this theoretical model are that voluntary (strategic) recollections of the trauma are fragmented and incomplete while involuntary (spontaneous) recollections are vivid and persistent and yield privileged access to traumatic material. Second, the authors describe differences between their model and other cognitive models of PTSD. They argue that these other models share the same 2 tenets as the diagnosis and show that these 2 tenets are largely unsupported by empirical evidence. Third, the authors counter arguments about the strength of the evidence favoring the mnemonic model. Fourth, they show that concerns about the causal role of memory in PTSD are based on views of causality that are generally inappropriate for the explanation of PTSD in the social and biological sciences. © 2008 American Psychological Association.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

An abstract of a thesis devoted to using helix-coil models to study unfolded states.\\

Research on polypeptide unfolded states has received much more attention in the last decade or so than it has in the past. Unfolded states are thought to be implicated in various

misfolding diseases and likely play crucial roles in protein folding equilibria and folding rates. Structural characterization of unfolded states has proven to be

much more difficult than the now well established practice of determining the structures of folded proteins. This is largely because many core assumptions underlying

folded structure determination methods are invalid for unfolded states. This has led to a dearth of knowledge concerning the nature of unfolded state conformational

distributions. While many aspects of unfolded state structure are not well known, there does exist a significant body of work stretching back half a century that

has been focused on structural characterization of marginally stable polypeptide systems. This body of work represents an extensive collection of experimental

data and biophysical models associated with describing helix-coil equilibria in polypeptide systems. Much of the work on unfolded states in the last decade has not been devoted

specifically to the improvement of our understanding of helix-coil equilibria, which arguably is the most well characterized of the various conformational equilibria

that likely contribute to unfolded state conformational distributions. This thesis seeks to provide a deeper investigation of helix-coil equilibria using modern

statistical data analysis and biophysical modeling techniques. The studies contained within seek to provide deeper insights and new perspectives on what we presumably

know very well about protein unfolded states. \\

Chapter 1 gives an overview of recent and historical work on studying protein unfolded states. The study of helix-coil equilibria is placed in the context

of the general field of unfolded state research and the basics of helix-coil models are introduced.\\

Chapter 2 introduces the newest incarnation of a sophisticated helix-coil model. State of the art modern statistical techniques are employed to estimate the energies

of various physical interactions that serve to influence helix-coil equilibria. A new Bayesian model selection approach is utilized to test many long-standing

hypotheses concerning the physical nature of the helix-coil transition. Some assumptions made in previous models are shown to be invalid and the new model

exhibits greatly improved predictive performance relative to its predecessor. \\

Chapter 3 introduces a new statistical model that can be used to interpret amide exchange measurements. As amide exchange can serve as a probe for residue-specific

properties of helix-coil ensembles, the new model provides a novel and robust method to use these types of measurements to characterize helix-coil ensembles experimentally

and test the position-specific predictions of helix-coil models. The statistical model is shown to perform exceedingly better than the most commonly used

method for interpreting amide exchange data. The estimates of the model obtained from amide exchange measurements on an example helical peptide

also show a remarkable consistency with the predictions of the helix-coil model. \\

Chapter 4 involves a study of helix-coil ensembles through the enumeration of helix-coil configurations. Aside from providing new insights into helix-coil ensembles,

this chapter also introduces a new method by which helix-coil models can be extended to calculate new types of observables. Future work on this approach could potentially

allow helix-coil models to move into use domains that were previously inaccessible and reserved for other types of unfolded state models that were introduced in chapter 1.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The problem of social diffusion has animated sociological thinking on topics ranging from the spread of an idea, an innovation or a disease, to the foundations of collective behavior and political polarization. While network diffusion has been a productive metaphor, the reality of diffusion processes is often muddier. Ideas and innovations diffuse differently from diseases, but, with a few exceptions, the diffusion of ideas and innovations has been modeled under the same assumptions as the diffusion of disease. In this dissertation, I develop two new diffusion models for "socially meaningful" contagions that address two of the most significant problems with current diffusion models: (1) that contagions can only spread along observed ties, and (2) that contagions do not change as they spread between people. I augment insights from these statistical and simulation models with an analysis of an empirical case of diffusion - the use of enterprise collaboration software in a large technology company. I focus the empirical study on when people abandon innovations, a crucial, and understudied aspect of the diffusion of innovations. Using timestamped posts, I analyze when people abandon software to a high degree of detail.

To address the first problem, I suggest a latent space diffusion model. Rather than treating ties as stable conduits for information, the latent space diffusion model treats ties as random draws from an underlying social space, and simulates diffusion over the social space. Theoretically, the social space model integrates both actor ties and attributes simultaneously in a single social plane, while incorporating schemas into diffusion processes gives an explicit form to the reciprocal influences that cognition and social environment have on each other. Practically, the latent space diffusion model produces statistically consistent diffusion estimates where using the network alone does not, and the diffusion with schemas model shows that introducing some cognitive processing into diffusion processes changes the rate and ultimate distribution of the spreading information. To address the second problem, I suggest a diffusion model with schemas. Rather than treating information as though it is spread without changes, the schema diffusion model allows people to modify information they receive to fit an underlying mental model of the information before they pass the information to others. Combining the latent space models with a schema notion for actors improves our models for social diffusion both theoretically and practically.

The empirical case study focuses on how the changing value of an innovation, introduced by the innovations' network externalities, influences when people abandon the innovation. In it, I find that people are least likely to abandon an innovation when other people in their neighborhood currently use the software as well. The effect is particularly pronounced for supervisors' current use and number of supervisory team members who currently use the software. This case study not only points to an important process in the diffusion of innovation, but also suggests a new approach -- computerized collaboration systems -- to collecting and analyzing data on organizational processes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The work presented in this dissertation is focused on applying engineering methods to develop and explore probabilistic survival models for the prediction of decompression sickness in US NAVY divers. Mathematical modeling, computational model development, and numerical optimization techniques were employed to formulate and evaluate the predictive quality of models fitted to empirical data. In Chapters 1 and 2 we present general background information relevant to the development of probabilistic models applied to predicting the incidence of decompression sickness. The remainder of the dissertation introduces techniques developed in an effort to improve the predictive quality of probabilistic decompression models and to reduce the difficulty of model parameter optimization.

The first project explored seventeen variations of the hazard function using a well-perfused parallel compartment model. Models were parametrically optimized using the maximum likelihood technique. Model performance was evaluated using both classical statistical methods and model selection techniques based on information theory. Optimized model parameters were overall similar to those of previously published Results indicated that a novel hazard function definition that included both ambient pressure scaling and individually fitted compartment exponent scaling terms.

We developed ten pharmacokinetic compartmental models that included explicit delay mechanics to determine if predictive quality could be improved through the inclusion of material transfer lags. A fitted discrete delay parameter augmented the inflow to the compartment systems from the environment. Based on the observation that symptoms are often reported after risk accumulation begins for many of our models, we hypothesized that the inclusion of delays might improve correlation between the model predictions and observed data. Model selection techniques identified two models as having the best overall performance, but comparison to the best performing model without delay and model selection using our best identified no delay pharmacokinetic model both indicated that the delay mechanism was not statistically justified and did not substantially improve model predictions.

Our final investigation explored parameter bounding techniques to identify parameter regions for which statistical model failure will not occur. When a model predicts a no probability of a diver experiencing decompression sickness for an exposure that is known to produce symptoms, statistical model failure occurs. Using a metric related to the instantaneous risk, we successfully identify regions where model failure will not occur and identify the boundaries of the region using a root bounding technique. Several models are used to demonstrate the techniques, which may be employed to reduce the difficulty of model optimization for future investigations.