10 resultados para data integration
em Duke University
Resumo:
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
Resumo:
Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Resumo:
BACKGROUND: A hierarchical taxonomy of organisms is a prerequisite for semantic integration of biodiversity data. Ideally, there would be a single, expansive, authoritative taxonomy that includes extinct and extant taxa, information on synonyms and common names, and monophyletic supraspecific taxa that reflect our current understanding of phylogenetic relationships. DESCRIPTION: As a step towards development of such a resource, and to enable large-scale integration of phenotypic data across vertebrates, we created the Vertebrate Taxonomy Ontology (VTO), a semantically defined taxonomic resource derived from the integration of existing taxonomic compilations, and freely distributed under a Creative Commons Zero (CC0) public domain waiver. The VTO includes both extant and extinct vertebrates and currently contains 106,947 taxonomic terms, 22 taxonomic ranks, 104,736 synonyms, and 162,400 cross-references to other taxonomic resources. Key challenges in constructing the VTO included (1) extracting and merging names, synonyms, and identifiers from heterogeneous sources; (2) structuring hierarchies of terms based on evolutionary relationships and the principle of monophyly; and (3) automating this process as much as possible to accommodate updates in source taxonomies. CONCLUSIONS: The VTO is the primary source of taxonomic information used by the Phenoscape Knowledgebase (http://phenoscape.org/), which integrates genetic and evolutionary phenotype data across both model and non-model vertebrates. The VTO is useful for inferring phenotypic changes on the vertebrate tree of life, which enables queries for candidate genes for various episodes in vertebrate evolution.
Resumo:
Hydrologic research is a very demanding application of fiber-optic distributed temperature sensing (DTS) in terms of precision, accuracy and calibration. The physics behind the most frequently used DTS instruments are considered as they apply to four calibration methods for single-ended DTS installations. The new methods presented are more accurate than the instrument-calibrated data, achieving accuracies on the order of tenths of a degree root mean square error (RMSE) and mean bias. Effects of localized non-uniformities that violate the assumptions of single-ended calibration data are explored and quantified. Experimental design considerations such as selection of integration times or selection of the length of the reference sections are discussed, and the impacts of these considerations on calibrated temperatures are explored in two case studies.
Resumo:
An enterprise information system (EIS) is an integrated data-applications platform characterized by diverse, heterogeneous, and distributed data sources. For many enterprises, a number of business processes still depend heavily on static rule-based methods and extensive human expertise. Enterprises are faced with the need for optimizing operation scheduling, improving resource utilization, discovering useful knowledge, and making data-driven decisions.
This thesis research is focused on real-time optimization and knowledge discovery that addresses workflow optimization, resource allocation, as well as data-driven predictions of process-execution times, order fulfillment, and enterprise service-level performance. In contrast to prior work on data analytics techniques for enterprise performance optimization, the emphasis here is on realizing scalable and real-time enterprise intelligence based on a combination of heterogeneous system simulation, combinatorial optimization, machine-learning algorithms, and statistical methods.
On-demand digital-print service is a representative enterprise requiring a powerful EIS.We use real-life data from Reischling Press, Inc. (RPI), a digit-print-service provider (PSP), to evaluate our optimization algorithms.
In order to handle the increase in volume and diversity of demands, we first present a high-performance, scalable, and real-time production scheduling algorithm for production automation based on an incremental genetic algorithm (IGA). The objective of this algorithm is to optimize the order dispatching sequence and balance resource utilization. Compared to prior work, this solution is scalable for a high volume of orders and it provides fast scheduling solutions for orders that require complex fulfillment procedures. Experimental results highlight its potential benefit in reducing production inefficiencies and enhancing the productivity of an enterprise.
We next discuss analysis and prediction of different attributes involved in hierarchical components of an enterprise. We start from a study of the fundamental processes related to real-time prediction. Our process-execution time and process status prediction models integrate statistical methods with machine-learning algorithms. In addition to improved prediction accuracy compared to stand-alone machine-learning algorithms, it also performs a probabilistic estimation of the predicted status. An order generally consists of multiple series and parallel processes. We next introduce an order-fulfillment prediction model that combines advantages of multiple classification models by incorporating flexible decision-integration mechanisms. Experimental results show that adopting due dates recommended by the model can significantly reduce enterprise late-delivery ratio. Finally, we investigate service-level attributes that reflect the overall performance of an enterprise. We analyze and decompose time-series data into different components according to their hierarchical periodic nature, perform correlation analysis,
and develop univariate prediction models for each component as well as multivariate models for correlated components. Predictions for the original time series are aggregated from the predictions of its components. In addition to a significant increase in mid-term prediction accuracy, this distributed modeling strategy also improves short-term time-series prediction accuracy.
In summary, this thesis research has led to a set of characterization, optimization, and prediction tools for an EIS to derive insightful knowledge from data and use them as guidance for production management. It is expected to provide solutions for enterprises to increase reconfigurability, accomplish more automated procedures, and obtain data-driven recommendations or effective decisions.
Resumo:
BACKGROUND: The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies. METHODOLOGY/PRINCIPAL FINDINGS: We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, http://zfin.org). We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators. CONCLUSIONS/SIGNIFICANCE: The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.
Resumo:
Building on the planning efforts of the RCN4GSC project, a workshop was convened in San Diego to bring together experts from genomics and metagenomics, biodiversity, ecology, and bioinformatics with the charge to identify potential for positive interactions and progress, especially building on successes at establishing data standards by the GSC and by the biodiversity and ecological communities. Until recently, the contribution of microbial life to the biomass and biodiversity of the biosphere was largely overlooked (because it was resistant to systematic study). Now, emerging genomic and metagenomic tools are making investigation possible. Initial research findings suggest that major advances are in the offing. Although different research communities share some overlapping concepts and traditions, they differ significantly in sampling approaches, vocabularies and workflows. Likewise, their definitions of 'fitness for use' for data differ significantly, as this concept stems from the specific research questions of most importance in the different fields. Nevertheless, there is little doubt that there is much to be gained from greater coordination and integration. As a first step toward interoperability of the information systems used by the different communities, participants agreed to conduct a case study on two of the leading data standards from the two formerly disparate fields: (a) GSC's standard checklists for genomics and metagenomics and (b) TDWG's Darwin Core standard, used primarily in taxonomy and systematic biology.
Resumo:
In chimpanzees, most females disperse from the community in which they were born to reproduce in a new community, thereby eliminating the risk of inbreeding with close kin. However, across sites, some females breed in their natal community, raising questions about the flexibility of dispersal, the costs and benefits of different strategies and the mitigation of costs associated with dispersal and integration. In this dissertation I address these questions by combining long-term behavioral data and recent field observations on maturing and young adult females in Gombe National Park with an experimental manipulation of relationship formation in captive apes in the Congo.
To assess the risk of inbreeding for females who do and do not disperse, 129 chimpanzees were genotyped and relatedness between each dyad was calculated. Natal females were more closely related to adult community males than were immigrant females. By examining the parentage of 58 surviving offspring, I found that natal females were not more related to the sires of their offspring than were immigrant females, despite three instances of close inbreeding. The sires of all offspring were less related to the mothers than non-sires regardless of the mother’s residence status. These results suggest that chimpanzees are capable of detecting relatedness and that, even when remaining natal, females can largely avoid, though not eliminate, inbreeding.
Next, I examined whether dispersal was associated with energetic, social, physiological and/or reproductive costs by comparing immigrant (n=10) and natal (n=9) females of similar age using 2358 hours of observational data. Natal and immigrant females did not differ in any energetic metric. Immigrant females received aggression from resident females more frequently than natal females. Immigrants spent less time in social grooming and more time self-grooming than natal females. Immigrant females primarily associated with resident males, had more social partners and lacked close social allies. There was no difference in levels of fecal glucocorticoid metabolites in immigrant and natal females. Immigrant females gave birth 2.5 years later than natal females, though the survival of their first offspring did not differ. These results indicate that immigrant females in Gombe National Park do not face energetic deficits upon transfer, but they do enter a hostile social environment and have a delayed first birth.
Next, I examined whether chimpanzees use condition- and phenotype-dependent cues in making dispersal decisions. I examined the effect of social and environmental conditions present at the time females of known age matured (n=25) on the females’ dispersal decisions. Females were more likely to disperse if they had more male maternal relatives and thus, a high risk of inbreeding. Females with a high ranking mother and multiple maternal female kin tended to disperse less frequently, suggesting that a strong female kin network provides benefits to the maturing daughter. Females were also somewhat less likely to disperse when fewer unrelated males were present in the group. Habitat quality and intrasexual competition did not affect dispersal decisions. Using a larger sample of 62 females observed as adults in Gombe, I also detected an effect of phenotypic differences in personality on the female’s dispersal decisions; extraverted, agreeable and open females were less likely to disperse.
Natural observations show that apes use grooming and play as social currency, but no experimental manipulations have been carried out to measure the effects of these behaviors on relationship formation, an essential component of integration. Thirty chimpanzees and 25 bonobos were given a choice between an unfamiliar human who had recently groomed or played with them over one who did not. Both species showed a preference for the human that had interacted with them, though the effect was driven by males. These results support the idea that grooming and play act as social currency in great apes that can rapidly shape social relationships between unfamiliar individuals. Further investigation is needed to elucidate the use of social currency in female apes.
I conclude that dispersal in female chimpanzees is flexible and the balance of costs and benefits varies for each individual. Females likely take into account social cues present at maturity and their own phenotype in choosing a settlement path and are especially sensitive to the presence of maternal male kin. The primary cost associated with philopatry is inbreeding risk and the primary cost associated with dispersal is delay in the age at first birth, presumably resulting from intense social competition. Finally, apes may strategically make use of affiliative behavior in pursuing particular relationships, something that should be useful in the integration process.
Resumo:
What role do state party organizations play in twenty-first century American politics? What is the nature of the relationship between the state and national party organizations in contemporary elections? These questions frame the three studies presented in this dissertation. More specifically, I examine the organizational development of the state party organizations and the strategic interactions and connections between the state and national party organizations in contemporary elections.
In the first empirical chapter, I argue that the Internet Age represents a significant transitional period for state party organizations. Using data collected from surveys of state party leaders, this chapter reevaluates and updates existing theories of party organizational strength and demonstrates the importance of new indicators of party technological capacity to our understanding of party organizational development in the early twenty-first century. In the second chapter, I ask whether the national parties utilize different strategies in deciding how to allocate resources to state parties through fund transfers and through the 50-state-strategy party-building programs that both the Democratic and Republican National Committees advertised during the 2010 elections. Analyzing data collected from my 2011 state party survey and party-fund-transfer data collected from the Federal Election Commission, I find that the national parties considered a combination of state and national electoral concerns in directing assistance to the state parties through their 50-state strategies, as opposed to the strict battleground-state strategy that explains party fund transfers. In my last chapter, I examine the relationships between platforms issued by Democratic and Republican state and national parties and the strategic considerations that explain why state platforms vary in their degree of similarity to the national platform. I analyze an extensive platform dataset, using cluster analysis and document similarity measures to compare platform content across the 1952 to 2014 period. The analysis shows that, as a group, Democratic and Republican state platforms exhibit greater intra-party homogeneity and inter-party heterogeneity starting in the early 1990s, and state-national platform similarity is higher in states that are key players in presidential elections, among other factors. Together, these three studies demonstrate the significance of the state party organizations and the state-national party partnership in contemporary politics.
Resumo:
Background
Postpartum hemorrhage is the most significant contributor to maternal mortality globally, claiming 140,000 lives annually. Postpartum hemorrhage is a leading cause of maternal death in South Africa, with the literature indicating that 80 percent of the postpartum hemorrhage deaths in South Africa are avoidable. Ghana, as of 2010, witnesses 2700 maternal deaths annually, primarily because of poor quality of care in health facilities and services being difficult to access. As per WHO recommendations, uterotonics are integral to treating postpartum hemorrhage as soon as it is diagnosed. In case of persistent bleeding or limited availability of uterotonics, the uterine balloon tamponade (UBT) can be used as a second line of defense. If both these measures are unable to counter the bleeding, providers must perform surgical interventions. Literature on the UBT, as one tool in the protocol to address postpartum hemorrhage, has shown it to have success rates ranging from 60 to 100 percent. Despite the potential to lower the number of postpartum hemorrhage deaths in South Africa and Ghana, the UBT has not been incorporated widely in South Africa and Ghana. The aim of this study is to describe the barriers involved with integrating the UBT into South Africa and Ghana’s health systems to address postpartum hemorrhage.
Methods
The study took place in multiple sites in South Africa (Cape Town, Johannesburg, Durban and Mpumalanga) and in Accra, Ghana. South Africa and Ghana were selected because postpartum hemorrhage contributes greatly to their maternal mortality numbers and there is potential in both countries to lower those rates through greater use of the UBT. A total of 25 participants were interviewed through purposive sampling, snowball sampling and participant referrals, and included various categories of stakeholders integral to the integration process of a medical device. Individual in-depth interviews were used for data collection, with interview questions being tailored to each stakeholder category. The focus of the interviews was on the protocol used to counter postpartum hemorrhage, the frequency with which the UBT is used as part of the protocol, and the process of integrating it into the South Africa and Ghana’s health systems. The data collected were coded using NVivo and analyzed using content analysis.
Results
The barriers to integration of the uterine balloon tamponade to address postpartum hemorrhage in South Africa and Ghana were evident on the political, economic and health delivery levels. The results indicated that the barriers to integration in South Africa included the low recognition of postpartum hemorrhage as a problem, the lack of clarity surrounding the role of the Medicines Control Council as a regulatory body for medical devices, and low awareness of the UBT as an intervention to control postpartum hemorrhage. The barriers in Ghana were the cash constraints experienced by the Ghana Health Services to fund medical devices, a heavy reliance on donors for funding, and the lack of consistent knowledge on processes involving clinical trials for new medical devices in Ghana.
Conclusion
Existing literature on methods to counter postpartum hemorrhage to reduce maternal mortality has focused on and emphasized the efficacy of the UBT. Despite overwhelming evidence supporting the use of the UBT, many health systems across the world, particularly low-income countries, do not have access to the device owing to numerous barriers in integrating the device into obstetric care. This study illustrates the need to focus on incorporating the UBT into health systems for greater availability to health workers and its use as standard of care. Ultimately, this study can be used as a stepping-stone for more research on this subject, providing evidence to influence policymakers to integrate the UBT into their protocols for postpartum hemorrhage response.