11 resultados para Data-Driven Behavior Modeling
em Duke University
Resumo:
An enterprise information system (EIS) is an integrated data-applications platform characterized by diverse, heterogeneous, and distributed data sources. For many enterprises, a number of business processes still depend heavily on static rule-based methods and extensive human expertise. Enterprises are faced with the need for optimizing operation scheduling, improving resource utilization, discovering useful knowledge, and making data-driven decisions.
This thesis research is focused on real-time optimization and knowledge discovery that addresses workflow optimization, resource allocation, as well as data-driven predictions of process-execution times, order fulfillment, and enterprise service-level performance. In contrast to prior work on data analytics techniques for enterprise performance optimization, the emphasis here is on realizing scalable and real-time enterprise intelligence based on a combination of heterogeneous system simulation, combinatorial optimization, machine-learning algorithms, and statistical methods.
On-demand digital-print service is a representative enterprise requiring a powerful EIS.We use real-life data from Reischling Press, Inc. (RPI), a digit-print-service provider (PSP), to evaluate our optimization algorithms.
In order to handle the increase in volume and diversity of demands, we first present a high-performance, scalable, and real-time production scheduling algorithm for production automation based on an incremental genetic algorithm (IGA). The objective of this algorithm is to optimize the order dispatching sequence and balance resource utilization. Compared to prior work, this solution is scalable for a high volume of orders and it provides fast scheduling solutions for orders that require complex fulfillment procedures. Experimental results highlight its potential benefit in reducing production inefficiencies and enhancing the productivity of an enterprise.
We next discuss analysis and prediction of different attributes involved in hierarchical components of an enterprise. We start from a study of the fundamental processes related to real-time prediction. Our process-execution time and process status prediction models integrate statistical methods with machine-learning algorithms. In addition to improved prediction accuracy compared to stand-alone machine-learning algorithms, it also performs a probabilistic estimation of the predicted status. An order generally consists of multiple series and parallel processes. We next introduce an order-fulfillment prediction model that combines advantages of multiple classification models by incorporating flexible decision-integration mechanisms. Experimental results show that adopting due dates recommended by the model can significantly reduce enterprise late-delivery ratio. Finally, we investigate service-level attributes that reflect the overall performance of an enterprise. We analyze and decompose time-series data into different components according to their hierarchical periodic nature, perform correlation analysis,
and develop univariate prediction models for each component as well as multivariate models for correlated components. Predictions for the original time series are aggregated from the predictions of its components. In addition to a significant increase in mid-term prediction accuracy, this distributed modeling strategy also improves short-term time-series prediction accuracy.
In summary, this thesis research has led to a set of characterization, optimization, and prediction tools for an EIS to derive insightful knowledge from data and use them as guidance for production management. It is expected to provide solutions for enterprises to increase reconfigurability, accomplish more automated procedures, and obtain data-driven recommendations or effective decisions.
Resumo:
Head motion during a Positron Emission Tomography (PET) brain scan can considerably degrade image quality. External motion-tracking devices have proven successful in minimizing this effect, but the associated time, maintenance, and workflow changes inhibit their widespread clinical use. List-mode PET acquisition allows for the retroactive analysis of coincidence events on any time scale throughout a scan, and therefore potentially offers a data-driven motion detection and characterization technique. An algorithm was developed to parse list-mode data, divide the full acquisition into short scan intervals, and calculate the line-of-response (LOR) midpoint average for each interval. These LOR midpoint averages, known as “radioactivity centroids,” were presumed to represent the center of the radioactivity distribution in the scanner, and it was thought that changes in this metric over time would correspond to intra-scan motion.
Several scans were taken of the 3D Hoffman brain phantom on a GE Discovery IQ PET/CT scanner to test the ability of the radioactivity to indicate intra-scan motion. Each scan incrementally surveyed motion in a different degree of freedom (2 translational and 2 rotational). The radioactivity centroids calculated from these scans correlated linearly to phantom positions/orientations. Centroid measurements over 1-second intervals performed on scans with ~1mCi of activity in the center of the field of view had standard deviations of 0.026 cm in the x- and y-dimensions and 0.020 cm in the z-dimension, which demonstrates high precision and repeatability in this metric. Radioactivity centroids are thus shown to successfully represent discrete motions on the submillimeter scale. It is also shown that while the radioactivity centroid can precisely indicate the amount of motion during an acquisition, it fails to distinguish what type of motion occurred.
Resumo:
Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Resumo:
While genome-wide gene expression data are generated at an increasing rate, the repertoire of approaches for pattern discovery in these data is still limited. Identifying subtle patterns of interest in large amounts of data (tens of thousands of profiles) associated with a certain level of noise remains a challenge. A microarray time series was recently generated to study the transcriptional program of the mouse segmentation clock, a biological oscillator associated with the periodic formation of the segments of the body axis. A method related to Fourier analysis, the Lomb-Scargle periodogram, was used to detect periodic profiles in the dataset, leading to the identification of a novel set of cyclic genes associated with the segmentation clock. Here, we applied to the same microarray time series dataset four distinct mathematical methods to identify significant patterns in gene expression profiles. These methods are called: Phase consistency, Address reduction, Cyclohedron test and Stable persistence, and are based on different conceptual frameworks that are either hypothesis- or data-driven. Some of the methods, unlike Fourier transforms, are not dependent on the assumption of periodicity of the pattern of interest. Remarkably, these methods identified blindly the expression profiles of known cyclic genes as the most significant patterns in the dataset. Many candidate genes predicted by more than one approach appeared to be true positive cyclic genes and will be of particular interest for future research. In addition, these methods predicted novel candidate cyclic genes that were consistent with previous biological knowledge and experimental validation in mouse embryos. Our results demonstrate the utility of these novel pattern detection strategies, notably for detection of periodic profiles, and suggest that combining several distinct mathematical approaches to analyze microarray datasets is a valuable strategy for identifying genes that exhibit novel, interesting transcriptional patterns.
Resumo:
To investigate the neural systems that contribute to the formation of complex, self-relevant emotional memories, dedicated fans of rival college basketball teams watched a competitive game while undergoing functional magnetic resonance imaging (fMRI). During a subsequent recognition memory task, participants were shown video clips depicting plays of the game, stemming either from previously-viewed game segments (targets) or from non-viewed portions of the same game (foils). After an old-new judgment, participants provided emotional valence and intensity ratings of the clips. A data driven approach was first used to decompose the fMRI signal acquired during free viewing of the game into spatially independent components. Correlations were then calculated between the identified components and post-scanning emotion ratings for successfully encoded targets. Two components were correlated with intensity ratings, including temporal lobe regions implicated in memory and emotional functions, such as the hippocampus and amygdala, as well as a midline fronto-cingulo-parietal network implicated in social cognition and self-relevant processing. These data were supported by a general linear model analysis, which revealed additional valence effects in fronto-striatal-insular regions when plays were divided into positive and negative events according to the fan's perspective. Overall, these findings contribute to our understanding of how emotional factors impact distributed neural systems to successfully encode dynamic, personally-relevant event sequences.
Resumo:
Antillean manatees (Trichechus manatus manatus) were heavily hunted in the past throughout the Wider Caribbean Region (WCR), and are currently listed as endangered on the IUCN Red List of Threatened Species. In most WCR countries, including Haiti and the Dominican Republic, remaining manatee populations are believed to be small and declining, but current information is needed on their status, distribution, and local threats to the species.
To assess the past and current distribution and conservation status of the Antillean manatee in Hispaniola, I conducted a systematic review of documentary archives dating from the pre-Columbian era to 2013. I then surveyed more than 670 artisanal fishers from Haiti and the Dominican Republic in 2013-2014 using a standardized questionnaire. Finally, to identify important areas for manatees in the Dominican Republic, I developed a country-wide ensemble model of manatee distribution, and compared modeled hotspots with those identified by fishers.
Manatees were historically abundant in Hispaniola, but were hunted for their meat and became relatively rare by the end of the 19th century. The use of manatee body parts diversified with time to include their oil, skin, and bones. Traditional uses for folk medicine and handcrafts persist today in coastal communities in the Dominican Republic. Most threats to Antillean manatees in Hispaniola are anthropogenic in nature, and most mortality is caused by fisheries. I estimated a minimum island-wide annual mortality of approximately 20 animals. To understand the impact of this level of mortality, and to provide a baseline for measuring the success of future conservation actions, the Dominican Republic and Haiti should work together to obtain a reliable estimate of the current population size of manatees in Hispaniola.
In Haiti, the survey of fishers showed a wider distribution range of the species than suggested by the documentary archive review: fishers reported recent manatee sightings in seven of nine coastal departments, and three manatee hotspot areas were identified in the north, central, and south coasts. Thus, the contracted manatee distribution range suggested by the documentary archive review likely reflects a lack of research in Haiti. Both the review and the interviews agreed that manatees no longer occupy freshwater habitats in the country. In general, more dedicated manatee studies are needed in Haiti, employing aerial, land, or boat surveys.
In the Dominican Republic, the documentary archive review and the survey of fishers showed that manatees still occur throughout the country, and occasionally occupy freshwater habitats. Monte Cristi province in the north coast, and Barahona province in the south coast, were identified as focal areas. Sighting reports of manatees decreased from Monte Cristi eastwards to the adjacent province in the Dominican Republic, and westwards into Haiti. Along the north coast of Haiti, the number of manatee sighting and capture reports decreased with increasing distance to Monte Cristi province. There was good agreement among the modeled manatee hotspots, hotspots identified by fishers, and hotspots identified during previous dedicated manatee studies. The concordance of these results suggests that the distribution and patterns of habitat use of manatees in the Dominican Republic have not changed dramatically in over 30 years, and that the remaining manatees exhibit some degree of site fidelity. The ensemble modeling approach used in the present study produced accurate and detailed maps of manatee distribution with minimum data requirements. This modeling strategy is replicable and readily transferable to other countries in the Caribbean or elsewhere with limited data on a species of interest.
The intrinsic value of manatees was stronger for artisanal fishers in the Dominican Republic than in Haiti, and most Dominican fishers showed a positive attitude towards manatee conservation. The Dominican Republic is an upper middle income country with a high Human Development Index. It possesses a legal framework that specifically protects manatees, and has a greater number of marine protected areas, more dedicated manatee studies, and more manatee education and awareness campaigns than Haiti. The constant presence of manatees in specific coastal segments of the Dominican Republic, the perceived decline in the number of manatee captures, and a more conservation-minded public, offer hope for manatee conservation, as non-consumptive uses of manatees become more popular. I recommend a series of conservation actions in the Dominican Republic, including: reducing risks to manatees from harmful fishing gear and watercraft at confirmed manatee hotspots; providing alternative economic alternatives for displaced fishers, and developing responsible ecotourism ventures for manatee watching; improving law enforcement to reduce fisheries-related manatee deaths, stop the illegal trade in manatee body parts, and better protect manatee habitat; and continuing education and awareness campaigns for coastal communities near manatee hotspots.
In contrast, most fishers in Haiti continue to value manatees as a source of food and income, and showed a generally negative attitude towards manatee conservation. Haiti is a low income country with a low Human Development Index. Only a single dedicated manatee study has been conducted in Haiti, and manatees are not officially protected. Positive initiatives for manatees in Haiti include: protected areas declared in 2013 and 2014 that enclose two of the manatee hotspots identified in the present study; and local organizations that are currently working on coastal and marine environmental issues, including research and education on marine mammals. Future conservation efforts for manatees in Haiti should focus on addressing poverty and providing viable economic alternatives for coastal communities. I recommend a community partnership approach for manatee conservation, paired with education and awareness campaigns to inform coastal communities about the conservation situation of manatees in Haiti, and to help change their perceived value. Haiti should also provide legal protection for manatees and their habitat.
Resumo:
Based on thermodynamic principles, we derive expressions quantifying the non-harmonic vibrational behavior of materials, which are rigorous yet easily evaluated from experimentally available data for the thermal expansion coefficient and the phonon density of states. These experimentally- derived quantities are valuable to benchmark first-principles theoretical predictions of harmonic and non-harmonic thermal behaviors using perturbation theory, ab initio molecular-dynamics, or Monte-Carlo simulations. We illustrate this analysis by computing the harmonic, dilational, and anharmonic contributions to the entropy, internal energy, and free energy of elemental aluminum and the ordered compound FeSi over a wide range of temperature. Results agree well with previous data in the literature and provide an efficient approach to estimate anharmonic effects in materials.
Resumo:
A class of multi-process models is developed for collections of time indexed count data. Autocorrelation in counts is achieved with dynamic models for the natural parameter of the binomial distribution. In addition to modeling binomial time series, the framework includes dynamic models for multinomial and Poisson time series. Markov chain Monte Carlo (MCMC) and Po ́lya-Gamma data augmentation (Polson et al., 2013) are critical for fitting multi-process models of counts. To facilitate computation when the counts are high, a Gaussian approximation to the P ́olya- Gamma random variable is developed.
Three applied analyses are presented to explore the utility and versatility of the framework. The first analysis develops a model for complex dynamic behavior of themes in collections of text documents. Documents are modeled as a “bag of words”, and the multinomial distribution is used to characterize uncertainty in the vocabulary terms appearing in each document. State-space models for the natural parameters of the multinomial distribution induce autocorrelation in themes and their proportional representation in the corpus over time.
The second analysis develops a dynamic mixed membership model for Poisson counts. The model is applied to a collection of time series which record neuron level firing patterns in rhesus monkeys. The monkey is exposed to two sounds simultaneously, and Gaussian processes are used to smoothly model the time-varying rate at which the neuron’s firing pattern fluctuates between features associated with each sound in isolation.
The third analysis presents a switching dynamic generalized linear model for the time-varying home run totals of professional baseball players. The model endows each player with an age specific latent natural ability class and a performance enhancing drug (PED) use indicator. As players age, they randomly transition through a sequence of ability classes in a manner consistent with traditional aging patterns. When the performance of the player significantly deviates from the expected aging pattern, he is identified as a player whose performance is consistent with PED use.
All three models provide a mechanism for sharing information across related series locally in time. The models are fit with variations on the P ́olya-Gamma Gibbs sampler, MCMC convergence diagnostics are developed, and reproducible inference is emphasized throughout the dissertation.
Resumo:
BACKGROUND: Sharing of epidemiological and clinical data sets among researchers is poor at best, in detriment of science and community at large. The purpose of this paper is therefore to (1) describe a novel Web application designed to share information on study data sets focusing on epidemiological clinical research in a collaborative environment and (2) create a policy model placing this collaborative environment into the current scientific social context. METHODOLOGY: The Database of Databases application was developed based on feedback from epidemiologists and clinical researchers requiring a Web-based platform that would allow for sharing of information about epidemiological and clinical study data sets in a collaborative environment. This platform should ensure that researchers can modify the information. A Model-based predictions of number of publications and funding resulting from combinations of different policy implementation strategies (for metadata and data sharing) were generated using System Dynamics modeling. PRINCIPAL FINDINGS: The application allows researchers to easily upload information about clinical study data sets, which is searchable and modifiable by other users in a wiki environment. All modifications are filtered by the database principal investigator in order to maintain quality control. The application has been extensively tested and currently contains 130 clinical study data sets from the United States, Australia, China and Singapore. Model results indicated that any policy implementation would be better than the current strategy, that metadata sharing is better than data-sharing, and that combined policies achieve the best results in terms of publications. CONCLUSIONS: Based on our empirical observations and resulting model, the social network environment surrounding the application can assist epidemiologists and clinical researchers contribute and search for metadata in a collaborative environment, thus potentially facilitating collaboration efforts among research communities distributed around the globe.
Resumo:
The full-scale base-isolated structure studied in this dissertation is the only base-isolated building in South Island of New Zealand. It sustained hundreds of earthquake ground motions from September 2010 and well into 2012. Several large earthquake responses were recorded in December 2011 by NEES@UCLA and by GeoNet recording station nearby Christchurch Women's Hospital. The primary focus of this dissertation is to advance the state-of-the art of the methods to evaluate performance of seismic-isolated structures and the effects of soil-structure interaction by developing new data processing methodologies to overcome current limitations and by implementing advanced numerical modeling in OpenSees for direct analysis of soil-structure interaction.
This dissertation presents a novel method for recovering force-displacement relations within the isolators of building structures with unknown nonlinearities from sparse seismic-response measurements of floor accelerations. The method requires only direct matrix calculations (factorizations and multiplications); no iterative trial-and-error methods are required. The method requires a mass matrix, or at least an estimate of the floor masses. A stiffness matrix may be used, but is not necessary. Essentially, the method operates on a matrix of incomplete measurements of floor accelerations. In the special case of complete floor measurements of systems with linear dynamics, real modes, and equal floor masses, the principal components of this matrix are the modal responses. In the more general case of partial measurements and nonlinear dynamics, the method extracts a number of linearly-dependent components from Hankel matrices of measured horizontal response accelerations, assembles these components row-wise and extracts principal components from the singular value decomposition of this large matrix of linearly-dependent components. These principal components are then interpolated between floors in a way that minimizes the curvature energy of the interpolation. This interpolation step can make use of a reduced-order stiffness matrix, a backward difference matrix or a central difference matrix. The measured and interpolated floor acceleration components at all floors are then assembled and multiplied by a mass matrix. The recovered in-service force-displacement relations are then incorporated into the OpenSees soil structure interaction model.
Numerical simulations of soil-structure interaction involving non-uniform soil behavior are conducted following the development of the complete soil-structure interaction model of Christchurch Women's Hospital in OpenSees. In these 2D OpenSees models, the superstructure is modeled as two-dimensional frames in short span and long span respectively. The lead rubber bearings are modeled as elastomeric bearing (Bouc Wen) elements. The soil underlying the concrete raft foundation is modeled with linear elastic plane strain quadrilateral element. The non-uniformity of the soil profile is incorporated by extraction and interpolation of shear wave velocity profile from the Canterbury Geotechnical Database. The validity of the complete two-dimensional soil-structure interaction OpenSees model for the hospital is checked by comparing the results of peak floor responses and force-displacement relations within the isolation system achieved from OpenSees simulations to the recorded measurements. General explanations and implications, supported by displacement drifts, floor acceleration and displacement responses, force-displacement relations are described to address the effects of soil-structure interaction.