8 resultados para Data replication processes
em Duke University
Resumo:
BACKGROUND: Biological processes occur on a vast range of time scales, and many of them occur concurrently. As a result, system-wide measurements of gene expression have the potential to capture many of these processes simultaneously. The challenge however, is to separate these processes and time scales in the data. In many cases the number of processes and their time scales is unknown. This issue is particularly relevant to developmental biologists, who are interested in processes such as growth, segmentation and differentiation, which can all take place simultaneously, but on different time scales. RESULTS: We introduce a flexible and statistically rigorous method for detecting different time scales in time-series gene expression data, by identifying expression patterns that are temporally shifted between replicate datasets. We apply our approach to a Saccharomyces cerevisiae cell-cycle dataset and an Arabidopsis thaliana root developmental dataset. In both datasets our method successfully detects processes operating on several different time scales. Furthermore we show that many of these time scales can be associated with particular biological functions. CONCLUSIONS: The spatiotemporal modules identified by our method suggest the presence of multiple biological processes, acting at distinct time scales in both the Arabidopsis root and yeast. Using similar large-scale expression datasets, the identification of biological processes acting at multiple time scales in many organisms is now possible.
Resumo:
BACKGROUND: Blochmannia are obligately intracellular bacterial mutualists of ants of the tribe Camponotini. Blochmannia perform key nutritional functions for the host, including synthesis of several essential amino acids. We used Illumina technology to sequence the genome of Blochmannia associated with Camponotus vafer. RESULTS: Although Blochmannia vafer retains many nutritional functions, it is missing glutamine synthetase (glnA), a component of the nitrogen recycling pathway encoded by the previously sequenced B. floridanus and B. pennsylvanicus. With the exception of Ureaplasma, B. vafer is the only sequenced bacterium to date that encodes urease but lacks the ability to assimilate ammonia into glutamine or glutamate. Loss of glnA occurred in a deletion hotspot near the putative replication origin. Overall, compared to the likely gene set of their common ancestor, 31 genes are missing or eroded in B. vafer, compared to 28 in B. floridanus and four in B. pennsylvanicus. Three genes (queA, visC and yggS) show convergent loss or erosion, suggesting relaxed selection for their functions. Eight B. vafer genes contain frameshifts in homopolymeric tracts that may be corrected by transcriptional slippage. Two of these encode DNA replication proteins: dnaX, which we infer is also frameshifted in B. floridanus, and dnaG. CONCLUSIONS: Comparing the B. vafer genome with B. pennsylvanicus and B. floridanus refines the core genes shared within the mutualist group, thereby clarifying functions required across ant host species. This third genome also allows us to track gene loss and erosion in a phylogenetic context to more fully understand processes of genome reduction.
Resumo:
BACKGROUND: Historically, only partial assessments of data quality have been performed in clinical trials, for which the most common method of measuring database error rates has been to compare the case report form (CRF) to database entries and count discrepancies. Importantly, errors arising from medical record abstraction and transcription are rarely evaluated as part of such quality assessments. Electronic Data Capture (EDC) technology has had a further impact, as paper CRFs typically leveraged for quality measurement are not used in EDC processes. METHODS AND PRINCIPAL FINDINGS: The National Institute on Drug Abuse Treatment Clinical Trials Network has developed, implemented, and evaluated methodology for holistically assessing data quality on EDC trials. We characterize the average source-to-database error rate (14.3 errors per 10,000 fields) for the first year of use of the new evaluation method. This error rate was significantly lower than the average of published error rates for source-to-database audits, and was similar to CRF-to-database error rates reported in the published literature. We attribute this largely to an absence of medical record abstraction on the trials we examined, and to an outpatient setting characterized by less acute patient conditions. CONCLUSIONS: Historically, medical record abstraction is the most significant source of error by an order of magnitude, and should be measured and managed during the course of clinical trials. Source-to-database error rates are highly dependent on the amount of structured data collection in the clinical setting and on the complexity of the medical record, dependencies that should be considered when developing data quality benchmarks.
Resumo:
BACKGROUND: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. METHODS/PRINCIPAL FINDINGS: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of "what if" situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. CONCLUSION/SIGNIFICANCE: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
Adrenergic receptors are prototypic models for the study of the relations between structure and function of G protein-coupled receptors. Each receptor is encoded by a distinct gene. These receptors are integral membrane proteins with several striking structural features. They consist of a single subunit containing seven stretches of 20-28 hydrophobic amino acids that represent potential membrane-spanning alpha-helixes. Many of these receptors share considerable amino acid sequence homology, particularly in the transmembrane domains. All of these macromolecules share other similarities that include one or more potential sites of extracellular N-linked glycosylation near the amino terminus and several potential sites of regulatory phosphorylation that are located intracellularly. By using a variety of techniques, it has been demonstrated that various regions of the receptor molecules are critical for different receptor functions. The seven transmembrane regions of the receptors appear to form a ligand-binding pocket. Cysteine residues in the extracellular domains may stabilize the ligand-binding pocket by participating in disulfide bonds. The cytoplasmic domains contain regions capable of interacting with G proteins and various kinases and are therefore important in such processes as signal transduction, receptor-G protein coupling, receptor sequestration, and down-regulation. Finally, regions of these macromolecules may undergo posttranslational modifications important in the regulation of receptor function. Our understanding of these complex relations is constantly evolving and much work remains to be done. Greater understanding of the basic mechanisms involved in G protein-coupled, receptor-mediated signal transduction may provide leads into the nature of certain pathophysiological states.
Resumo:
An enterprise information system (EIS) is an integrated data-applications platform characterized by diverse, heterogeneous, and distributed data sources. For many enterprises, a number of business processes still depend heavily on static rule-based methods and extensive human expertise. Enterprises are faced with the need for optimizing operation scheduling, improving resource utilization, discovering useful knowledge, and making data-driven decisions.
This thesis research is focused on real-time optimization and knowledge discovery that addresses workflow optimization, resource allocation, as well as data-driven predictions of process-execution times, order fulfillment, and enterprise service-level performance. In contrast to prior work on data analytics techniques for enterprise performance optimization, the emphasis here is on realizing scalable and real-time enterprise intelligence based on a combination of heterogeneous system simulation, combinatorial optimization, machine-learning algorithms, and statistical methods.
On-demand digital-print service is a representative enterprise requiring a powerful EIS.We use real-life data from Reischling Press, Inc. (RPI), a digit-print-service provider (PSP), to evaluate our optimization algorithms.
In order to handle the increase in volume and diversity of demands, we first present a high-performance, scalable, and real-time production scheduling algorithm for production automation based on an incremental genetic algorithm (IGA). The objective of this algorithm is to optimize the order dispatching sequence and balance resource utilization. Compared to prior work, this solution is scalable for a high volume of orders and it provides fast scheduling solutions for orders that require complex fulfillment procedures. Experimental results highlight its potential benefit in reducing production inefficiencies and enhancing the productivity of an enterprise.
We next discuss analysis and prediction of different attributes involved in hierarchical components of an enterprise. We start from a study of the fundamental processes related to real-time prediction. Our process-execution time and process status prediction models integrate statistical methods with machine-learning algorithms. In addition to improved prediction accuracy compared to stand-alone machine-learning algorithms, it also performs a probabilistic estimation of the predicted status. An order generally consists of multiple series and parallel processes. We next introduce an order-fulfillment prediction model that combines advantages of multiple classification models by incorporating flexible decision-integration mechanisms. Experimental results show that adopting due dates recommended by the model can significantly reduce enterprise late-delivery ratio. Finally, we investigate service-level attributes that reflect the overall performance of an enterprise. We analyze and decompose time-series data into different components according to their hierarchical periodic nature, perform correlation analysis,
and develop univariate prediction models for each component as well as multivariate models for correlated components. Predictions for the original time series are aggregated from the predictions of its components. In addition to a significant increase in mid-term prediction accuracy, this distributed modeling strategy also improves short-term time-series prediction accuracy.
In summary, this thesis research has led to a set of characterization, optimization, and prediction tools for an EIS to derive insightful knowledge from data and use them as guidance for production management. It is expected to provide solutions for enterprises to increase reconfigurability, accomplish more automated procedures, and obtain data-driven recommendations or effective decisions.
Resumo:
Cells have evolved oscillators with different frequencies to coordinate periodic processes. Here we studied the interaction of two oscillators, the cell division cycle (CDC) and the yeast metabolic cycle (YMC), in budding yeast. Previous work suggested that the CDC and YMC interact to separate high oxygen consumption (HOC) from DNA replication to prevent genetic damage. To test this hypothesis, we grew diverse strains in chemostat and measured DNA replication and oxygen consumption with high temporal resolution at different growth rates. Our data showed that HOC is not strictly separated from DNA replication; rather, cell cycle Start is coupled with the initiation of HOC and catabolism of storage carbohydrates. The logic of this YMC-CDC coupling may be to ensure that DNA replication and cell division occur only when sufficient cellular energy reserves have accumulated. Our results also uncovered a quantitative relationship between CDC period and YMC period across different strains. More generally, our approach shows how studies in genetically diverse strains efficiently identify robust phenotypes and steer the experimentalist away from strain-specific idiosyncrasies.
Resumo:
© 2016 Burnetti et al. Cells have evolved oscillators with different frequencies to coordinate periodic processes. Here we studied the interaction of two oscillators, the cell division cycle (CDC) and the yeast metabolic cycle (YMC), in budding yeast. Previous work suggested that the CDC and YMC interact to separate high oxygen consumption (HOC) from DNA replication to prevent genetic damage. To test this hypothesis, we grew diverse strains in chemostat and measured DNA replication and oxygen consumption with high temporal resolution at different growth rates. Our data showed that HOC is not strictly separated from DNA replication; rather, cell cycle Start is coupled with the initiation of HOC and catabolism of storage carbohydrates. The logic of this YMC-CDC coupling may be to ensure that DNA replication and cell division occur only when sufficient cellular energy reserves have accumulated. Our results also uncovered a quantitative relationship between CDC period and YMC period across different strains. More generally, our approach shows how studies in genetically diverse strains efficiently identify robust phenotypes and steer the experimentalist away from strain-specific idiosyncrasies.