9 resultados para Data modeling

em Duke University


Relevância:

40.00% 40.00%

Publicador:

Resumo:

A class of multi-process models is developed for collections of time indexed count data. Autocorrelation in counts is achieved with dynamic models for the natural parameter of the binomial distribution. In addition to modeling binomial time series, the framework includes dynamic models for multinomial and Poisson time series. Markov chain Monte Carlo (MCMC) and Po ́lya-Gamma data augmentation (Polson et al., 2013) are critical for fitting multi-process models of counts. To facilitate computation when the counts are high, a Gaussian approximation to the P ́olya- Gamma random variable is developed.

Three applied analyses are presented to explore the utility and versatility of the framework. The first analysis develops a model for complex dynamic behavior of themes in collections of text documents. Documents are modeled as a “bag of words”, and the multinomial distribution is used to characterize uncertainty in the vocabulary terms appearing in each document. State-space models for the natural parameters of the multinomial distribution induce autocorrelation in themes and their proportional representation in the corpus over time.

The second analysis develops a dynamic mixed membership model for Poisson counts. The model is applied to a collection of time series which record neuron level firing patterns in rhesus monkeys. The monkey is exposed to two sounds simultaneously, and Gaussian processes are used to smoothly model the time-varying rate at which the neuron’s firing pattern fluctuates between features associated with each sound in isolation.

The third analysis presents a switching dynamic generalized linear model for the time-varying home run totals of professional baseball players. The model endows each player with an age specific latent natural ability class and a performance enhancing drug (PED) use indicator. As players age, they randomly transition through a sequence of ability classes in a manner consistent with traditional aging patterns. When the performance of the player significantly deviates from the expected aging pattern, he is identified as a player whose performance is consistent with PED use.

All three models provide a mechanism for sharing information across related series locally in time. The models are fit with variations on the P ́olya-Gamma Gibbs sampler, MCMC convergence diagnostics are developed, and reproducible inference is emphasized throughout the dissertation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In perifusion cell cultures, the culture medium flows continuously through a chamber containing immobilized cells and the effluent is collected at the end. In our main applications, gonadotropin releasing hormone (GnRH) or oxytocin is introduced into the chamber as the input. They stimulate the cells to secrete luteinizing hormone (LH), which is collected in the effluent. To relate the effluent LH concentration to the cellular processes producing it, we develop and analyze a mathematical model consisting of coupled partial differential equations describing the intracellular signaling and the movement of substances in the cell chamber. We analyze three different data sets and give cellular mechanisms that explain the data. Our model indicates that two negative feedback loops, one fast and one slow, are needed to explain the data and we give their biological bases. We demonstrate that different LH outcomes in oxytocin and GnRH stimulations might originate from different receptor dynamics. We analyze the model to understand the influence of parameters, like the rate of the medium flow or the fraction collection time, on the experimental outcomes. We investigate how the rate of binding and dissociation of the input hormone to and from its receptor influence its movement down the chamber. Finally, we formulate and analyze simpler models that allow us to predict the distortion of a square pulse due to hormone-receptor interactions and to estimate parameters using perifusion data. We show that in the limit of high binding and dissociation the square pulse moves as a diffusing Gaussian and in this limit the biological parameters can be estimated.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.

Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The full-scale base-isolated structure studied in this dissertation is the only base-isolated building in South Island of New Zealand. It sustained hundreds of earthquake ground motions from September 2010 and well into 2012. Several large earthquake responses were recorded in December 2011 by NEES@UCLA and by GeoNet recording station nearby Christchurch Women's Hospital. The primary focus of this dissertation is to advance the state-of-the art of the methods to evaluate performance of seismic-isolated structures and the effects of soil-structure interaction by developing new data processing methodologies to overcome current limitations and by implementing advanced numerical modeling in OpenSees for direct analysis of soil-structure interaction.

This dissertation presents a novel method for recovering force-displacement relations within the isolators of building structures with unknown nonlinearities from sparse seismic-response measurements of floor accelerations. The method requires only direct matrix calculations (factorizations and multiplications); no iterative trial-and-error methods are required. The method requires a mass matrix, or at least an estimate of the floor masses. A stiffness matrix may be used, but is not necessary. Essentially, the method operates on a matrix of incomplete measurements of floor accelerations. In the special case of complete floor measurements of systems with linear dynamics, real modes, and equal floor masses, the principal components of this matrix are the modal responses. In the more general case of partial measurements and nonlinear dynamics, the method extracts a number of linearly-dependent components from Hankel matrices of measured horizontal response accelerations, assembles these components row-wise and extracts principal components from the singular value decomposition of this large matrix of linearly-dependent components. These principal components are then interpolated between floors in a way that minimizes the curvature energy of the interpolation. This interpolation step can make use of a reduced-order stiffness matrix, a backward difference matrix or a central difference matrix. The measured and interpolated floor acceleration components at all floors are then assembled and multiplied by a mass matrix. The recovered in-service force-displacement relations are then incorporated into the OpenSees soil structure interaction model.

Numerical simulations of soil-structure interaction involving non-uniform soil behavior are conducted following the development of the complete soil-structure interaction model of Christchurch Women's Hospital in OpenSees. In these 2D OpenSees models, the superstructure is modeled as two-dimensional frames in short span and long span respectively. The lead rubber bearings are modeled as elastomeric bearing (Bouc Wen) elements. The soil underlying the concrete raft foundation is modeled with linear elastic plane strain quadrilateral element. The non-uniformity of the soil profile is incorporated by extraction and interpolation of shear wave velocity profile from the Canterbury Geotechnical Database. The validity of the complete two-dimensional soil-structure interaction OpenSees model for the hospital is checked by comparing the results of peak floor responses and force-displacement relations within the isolation system achieved from OpenSees simulations to the recorded measurements. General explanations and implications, supported by displacement drifts, floor acceleration and displacement responses, force-displacement relations are described to address the effects of soil-structure interaction.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Purpose: To build a model that will predict the survival time for patients that were treated with stereotactic radiosurgery for brain metastases using support vector machine (SVM) regression.

Methods and Materials: This study utilized data from 481 patients, which were equally divided into training and validation datasets randomly. The SVM model used a Gaussian RBF function, along with various parameters, such as the size of the epsilon insensitive region and the cost parameter (C) that are used to control the amount of error tolerated by the model. The predictor variables for the SVM model consisted of the actual survival time of the patient, the number of brain metastases, the graded prognostic assessment (GPA) and Karnofsky Performance Scale (KPS) scores, prescription dose, and the largest planning target volume (PTV). The response of the model is the survival time of the patient. The resulting survival time predictions were analyzed against the actual survival times by single parameter classification and two-parameter classification. The predicted mean survival times within each classification were compared with the actual values to obtain the confidence interval associated with the model’s predictions. In addition to visualizing the data on plots using the means and error bars, the correlation coefficients between the actual and predicted means of the survival times were calculated during each step of the classification.

Results: The number of metastases and KPS scores, were consistently shown to be the strongest predictors in the single parameter classification, and were subsequently used as first classifiers in the two-parameter classification. When the survival times were analyzed with the number of metastases as the first classifier, the best correlation was obtained for patients with 3 metastases, while patients with 4 or 5 metastases had significantly worse results. When the KPS score was used as the first classifier, patients with a KPS score of 60 and 90/100 had similar strong correlation results. These mixed results are likely due to the limited data available for patients with more than 3 metastases or KPS scores of 60 or less.

Conclusions: The number of metastases and the KPS score both showed to be strong predictors of patient survival time. The model was less accurate for patients with more metastases and certain KPS scores due to the lack of training data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a novel method to harmonize diffusion MRI data acquired from multiple sites and scanners, which is imperative for joint analysis of the data to significantly increase sample size and statistical power of neuroimaging studies. Our method incorporates the following main novelties: i) we take into account the scanner-dependent spatial variability of the diffusion signal in different parts of the brain; ii) our method is independent of compartmental modeling of diffusion (e.g., tensor, and intra/extra cellular compartments) and the acquired signal itself is corrected for scanner related differences; and iii) inter-subject variability as measured by the coefficient of variation is maintained at each site. We represent the signal in a basis of spherical harmonics and compute several rotation invariant spherical harmonic features to estimate a region and tissue specific linear mapping between the signal from different sites (and scanners). We validate our method on diffusion data acquired from seven different sites (including two GE, three Philips, and two Siemens scanners) on a group of age-matched healthy subjects. Since the extracted rotation invariant spherical harmonic features depend on the accuracy of the brain parcellation provided by Freesurfer, we propose a feature based refinement of the original parcellation such that it better characterizes the anatomy and provides robust linear mappings to harmonize the dMRI data. We demonstrate the efficacy of our method by statistically comparing diffusion measures such as fractional anisotropy, mean diffusivity and generalized fractional anisotropy across multiple sites before and after data harmonization. We also show results using tract-based spatial statistics before and after harmonization for independent validation of the proposed methodology. Our experimental results demonstrate that, for nearly identical acquisition protocol across sites, scanner-specific differences can be accurately removed using the proposed method.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Surveys can collect important data that inform policy decisions and drive social science research. Large government surveys collect information from the U.S. population on a wide range of topics, including demographics, education, employment, and lifestyle. Analysis of survey data presents unique challenges. In particular, one needs to account for missing data, for complex sampling designs, and for measurement error. Conceptually, a survey organization could spend lots of resources getting high-quality responses from a simple random sample, resulting in survey data that are easy to analyze. However, this scenario often is not realistic. To address these practical issues, survey organizations can leverage the information available from other sources of data. For example, in longitudinal studies that suffer from attrition, they can use the information from refreshment samples to correct for potential attrition bias. They can use information from known marginal distributions or survey design to improve inferences. They can use information from gold standard sources to correct for measurement error.

This thesis presents novel approaches to combining information from multiple sources that address the three problems described above.

The first method addresses nonignorable unit nonresponse and attrition in a panel survey with a refreshment sample. Panel surveys typically suffer from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, the panel data alone cannot inform the extent of the bias due to attrition, so analysts must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new

individuals during some later wave of the panel. Refreshment samples offer information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the

refreshment sample itself. As we illustrate, nonignorable unit nonresponse

can significantly compromise the analyst's ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences---corrected for panel attrition---are to different assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse

in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study.

The second method incorporates informative prior beliefs about

marginal probabilities into Bayesian latent class models for categorical data.

The basic idea is to append synthetic observations to the original data such that

(i) the empirical distributions of the desired margins match those of the prior beliefs, and (ii) the values of the remaining variables are left missing. The degree of prior uncertainty is controlled by the number of augmented records. Posterior inferences can be obtained via typical MCMC algorithms for latent class models, tailored to deal efficiently with the missing values in the concatenated data.

We illustrate the approach using a variety of simulations based on data from the American Community Survey, including an example of how augmented records can be used to fit latent class models to data from stratified samples.

The third method leverages the information from a gold standard survey to model reporting error. Survey data are subject to reporting error when respondents misunderstand the question or accidentally select the wrong response. Sometimes survey respondents knowingly select the wrong response, for example, by reporting a higher level of education than they actually have attained. We present an approach that allows an analyst to model reporting error by incorporating information from a gold standard survey. The analyst can specify various reporting error models and assess how sensitive their conclusions are to different assumptions about the reporting error process. We illustrate the approach using simulations based on data from the 1993 National Survey of College Graduates. We use the method to impute error-corrected educational attainments in the 2010 American Community Survey using the 2010 National Survey of College Graduates as the gold standard survey.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Bayesian nonparametric models, such as the Gaussian process and the Dirichlet process, have been extensively applied for target kinematics modeling in various applications including environmental monitoring, traffic planning, endangered species tracking, dynamic scene analysis, autonomous robot navigation, and human motion modeling. As shown by these successful applications, Bayesian nonparametric models are able to adjust their complexities adaptively from data as necessary, and are resistant to overfitting or underfitting. However, most existing works assume that the sensor measurements used to learn the Bayesian nonparametric target kinematics models are obtained a priori or that the target kinematics can be measured by the sensor at any given time throughout the task. Little work has been done for controlling the sensor with bounded field of view to obtain measurements of mobile targets that are most informative for reducing the uncertainty of the Bayesian nonparametric models. To present the systematic sensor planning approach to leaning Bayesian nonparametric models, the Gaussian process target kinematics model is introduced at first, which is capable of describing time-invariant spatial phenomena, such as ocean currents, temperature distributions and wind velocity fields. The Dirichlet process-Gaussian process target kinematics model is subsequently discussed for modeling mixture of mobile targets, such as pedestrian motion patterns.

Novel information theoretic functions are developed for these introduced Bayesian nonparametric target kinematics models to represent the expected utility of measurements as a function of sensor control inputs and random environmental variables. A Gaussian process expected Kullback Leibler divergence is developed as the expectation of the KL divergence between the current (prior) and posterior Gaussian process target kinematics models with respect to the future measurements. Then, this approach is extended to develop a new information value function that can be used to estimate target kinematics described by a Dirichlet process-Gaussian process mixture model. A theorem is proposed that shows the novel information theoretic functions are bounded. Based on this theorem, efficient estimators of the new information theoretic functions are designed, which are proved to be unbiased with the variance of the resultant approximation error decreasing linearly as the number of samples increases. Computational complexities for optimizing the novel information theoretic functions under sensor dynamics constraints are studied, and are proved to be NP-hard. A cumulative lower bound is then proposed to reduce the computational complexity to polynomial time.

Three sensor planning algorithms are developed according to the assumptions on the target kinematics and the sensor dynamics. For problems where the control space of the sensor is discrete, a greedy algorithm is proposed. The efficiency of the greedy algorithm is demonstrated by a numerical experiment with data of ocean currents obtained by moored buoys. A sweep line algorithm is developed for applications where the sensor control space is continuous and unconstrained. Synthetic simulations as well as physical experiments with ground robots and a surveillance camera are conducted to evaluate the performance of the sweep line algorithm. Moreover, a lexicographic algorithm is designed based on the cumulative lower bound of the novel information theoretic functions, for the scenario where the sensor dynamics are constrained. Numerical experiments with real data collected from indoor pedestrians by a commercial pan-tilt camera are performed to examine the lexicographic algorithm. Results from both the numerical simulations and the physical experiments show that the three sensor planning algorithms proposed in this dissertation based on the novel information theoretic functions are superior at learning the target kinematics with

little or no prior knowledge

Relevância:

30.00% 30.00%

Publicador:

Resumo:

While molecular and cellular processes are often modeled as stochastic processes, such as Brownian motion, chemical reaction networks and gene regulatory networks, there are few attempts to program a molecular-scale process to physically implement stochastic processes. DNA has been used as a substrate for programming molecular interactions, but its applications are restricted to deterministic functions and unfavorable properties such as slow processing, thermal annealing, aqueous solvents and difficult readout limit them to proof-of-concept purposes. To date, whether there exists a molecular process that can be programmed to implement stochastic processes for practical applications remains unknown.

In this dissertation, a fully specified Resonance Energy Transfer (RET) network between chromophores is accurately fabricated via DNA self-assembly, and the exciton dynamics in the RET network physically implement a stochastic process, specifically a continuous-time Markov chain (CTMC), which has a direct mapping to the physical geometry of the chromophore network. Excited by a light source, a RET network generates random samples in the temporal domain in the form of fluorescence photons which can be detected by a photon detector. The intrinsic sampling distribution of a RET network is derived as a phase-type distribution configured by its CTMC model. The conclusion is that the exciton dynamics in a RET network implement a general and important class of stochastic processes that can be directly and accurately programmed and used for practical applications of photonics and optoelectronics. Different approaches to using RET networks exist with vast potential applications. As an entropy source that can directly generate samples from virtually arbitrary distributions, RET networks can benefit applications that rely on generating random samples such as 1) fluorescent taggants and 2) stochastic computing.

By using RET networks between chromophores to implement fluorescent taggants with temporally coded signatures, the taggant design is not constrained by resolvable dyes and has a significantly larger coding capacity than spectrally or lifetime coded fluorescent taggants. Meanwhile, the taggant detection process becomes highly efficient, and the Maximum Likelihood Estimation (MLE) based taggant identification guarantees high accuracy even with only a few hundred detected photons.

Meanwhile, RET-based sampling units (RSU) can be constructed to accelerate probabilistic algorithms for wide applications in machine learning and data analytics. Because probabilistic algorithms often rely on iteratively sampling from parameterized distributions, they can be inefficient in practice on the deterministic hardware traditional computers use, especially for high-dimensional and complex problems. As an efficient universal sampling unit, the proposed RSU can be integrated into a processor / GPU as specialized functional units or organized as a discrete accelerator to bring substantial speedups and power savings.