17 resultados para Data exchange formats
em Université de Lausanne, Switzerland
Resumo:
The HUPO Proteomics Standards Initiative has developed several standardized data formats to facilitate data sharing in mass spectrometry (MS)-based proteomics. These allow researchers to report their complete results in a unified way. However, at present, there is no format to describe the final qualitative and quantitative results for proteomics and metabolomics experiments in a simple tabular format. Many downstream analysis use cases are only concerned with the final results of an experiment and require an easily accessible format, compatible with tools such as Microsoft Excel or R. We developed the mzTab file format for MS-based proteomics and metabolomics results to meet this need. mzTab is intended as a lightweight supplement to the existing standard XML-based file formats (mzML, mzIdentML, mzQuantML), providing a comprehensive summary, similar in concept to the supplemental material of a scientific publication. mzTab files can contain protein, peptide, and small molecule identifications together with experimental metadata and basic quantitative information. The format is not intended to store the complete experimental evidence but provides mechanisms to report results at different levels of detail. These range from a simple summary of the final results to a representation of the results including the experimental design. This format is ideally suited to make MS-based proteomics and metabolomics results available to a wider biological community outside the field of MS. Several software tools for proteomics and metabolomics have already adapted the format as an output format. The comprehensive mzTab specification document and extensive additional documentation can be found online.
Resumo:
BACKGROUND: Molecular interaction Information is a key resource in modern biomedical research. Publicly available data have previously been provided in a broad array of diverse formats, making access to this very difficult. The publication and wide implementation of the Human Proteome Organisation Proteomics Standards Initiative Molecular Interactions (HUPO PSI-MI) format in 2004 was a major step towards the establishment of a single, unified format by which molecular interactions should be presented, but focused purely on protein-protein interactions. RESULTS: The HUPO-PSI has further developed the PSI-MI XML schema to enable the description of interactions between a wider range of molecular types, for example nucleic acids, chemical entities, and molecular complexes. Extensive details about each supported molecular interaction can now be captured, including the biological role of each molecule within that interaction, detailed description of interacting domains, and the kinetic parameters of the interaction. The format is supported by data management and analysis tools and has been adopted by major interaction data providers. Additionally, a simpler, tab-delimited format MITAB2.5 has been developed for the benefit of users who require only minimal information in an easy to access configuration. CONCLUSION: The PSI-MI XML2.5 and MITAB2.5 formats have been jointly developed by interaction data producers and providers from both the academic and commercial sector, and are already widely implemented and well supported by an active development community. PSI-MI XML2.5 enables the description of highly detailed molecular interaction data and facilitates data exchange between databases and users without loss of information. MITAB2.5 is a simpler format appropriate for fast Perl parsing or loading into Microsoft Excel.
Resumo:
OBJECTIVE: Quality assurance (QA) in clinical trials is essential to ensure treatment is safely and effectively delivered. As QA requirements have increased in complexity in parallel with evolution of radiation therapy (RT) delivery, a need to facilitate digital data exchange emerged. Our objective is to present the platform developed for the integration and standardization of QART activities across all EORTC trials involving RT. METHODS: The following essential requirements were identified: secure and easy access without on-site software installation; integration within the existing EORTC clinical remote data capture system; and the ability to both customize the platform to specific studies and adapt to future needs. After retrospective testing within several clinical trials, the platform was introduced in phases to participating sites and QART study reviewers. RESULTS: The resulting QA platform, integrating RT analysis software installed at EORTC Headquarters, permits timely, secure, and fully digital central DICOM-RT based data review. Participating sites submit data through a standard secure upload webpage. Supplemental information is submitted in parallel through web-based forms. An internal quality check by the QART office verifies data consistency, formatting, and anonymization. QART reviewers have remote access through a terminal server. Reviewers evaluate submissions for protocol compliance through an online evaluation matrix. Comments are collected by the coordinating centre and institutions are informed of the results. CONCLUSIONS: This web-based central review platform facilitates rapid, extensive, and prospective QART review. This reduces the risk that trial outcomes are compromised through inadequate radiotherapy and facilitates correlation of results with clinical outcomes.
Resumo:
A population register is an inventory of residents within a country, with their characteristics (date of birth, sex, marital status, etc.) and other socio-economic data, such as occupation or education. However, data on population are also stored in numerous other public registers such as tax, land, building and housing, military, foreigners, vehicles, etc. Altogether they contain vast amounts of personal and sensitive information. Access to public information is granted by law in many countries, but this transparency is generally subject to tensions with data protection laws. This paper proposes a framework to analyze data access (or protection) requirements, as well as a model of metadata for data exchange.
Resumo:
In June 2006, the Swiss Parliament made two important decisions with regards to public registers' governance and individuals' identification. It adopted a new law on the harmonisation of population registers in order to simplify statistical data collection and data exchange from around 4'000 decentralized registers, and it also approved the introduction of a Unique Person Identifier (UPI). The law is rather vague about the implementation of this harmonisation and even though many projects are currently being undertaken in this domain, most of them are quite technical. We believe there is a need for analysis tools and therefore we propose a conceptual framework based on three pillars (Privacy, Identity and Governance) to analyse the requirements in terms of data management for population registers.
Resumo:
Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: "no copy" approach - data stay mostly in the CSV files; "zero configuration" - no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.
Resumo:
Following protection measures implemented since the 1970s, large carnivores are currently increasing in number and returning to areas from which they were absent for decades or even centuries. Monitoring programmes for these species rely extensively on non-invasive sampling and genotyping. However, attempts to connect results of such studies at larger spatial or temporal scales often suffer from the incompatibility of genetic markers implemented by researchers in different laboratories. This is particularly critical for long-distance dispersers, revealing the need for harmonized monitoring schemes that would enable the understanding of gene flow and dispersal dynamics. Based on a review of genetic studies on grey wolves Canis lupus from Europe, we provide an overview of the genetic markers currently in use, and identify opportunities and hurdles for studies based on continent-scale datasets. Our results highlight an urgent need for harmonization of methods to enable transnational research based on data that have already been collected, and to allow these data to be linked to material collected in the future. We suggest timely standardization of newly developed genotyping approaches, and propose that action is directed towards the establishment of shared single nucleotide polymorphism panels, next-generation sequencing of microsatellites, a common reference sample collection and an online database for data exchange. Enhanced cooperation among genetic researchers dealing with large carnivores in consortia would facilitate streamlining of methods, their faster and wider adoption, and production of results at the large spatial scales that ultimately matter for the conservation of these charismatic species.
Resumo:
The International Molecular Exchange (IMEx) consortium is an international collaboration between major public interaction data providers to share literature-curation efforts and make a nonredundant set of protein interactions available in a single search interface on a common website (http://www.imexconsortium.org/). Common curation rules have been developed, and a central registry is used to manage the selection of articles to enter into the dataset. We discuss the advantages of such a service to the user, our quality-control measures and our data-distribution practices.
Resumo:
Indirect calorimetry based on respiratory exchange measurement has been successfully used from the beginning of the century to obtain an estimate of heat production (energy expenditure) in human subjects and animals. The errors inherent to this classical technique can stem from various sources: 1) model of calculation and assumptions, 2) calorimetric factors used, 3) technical factors and 4) human factors. The physiological and biochemical factors influencing the interpretation of calorimetric data include a change in the size of the bicarbonate and urea pools and the accumulation or loss (via breath, urine or sweat) of intermediary metabolites (gluconeogenesis, ketogenesis). More recently, respiratory gas exchange data have been used to estimate substrate utilization rates in various physiological and metabolic situations (fasting, post-prandial state, etc.). It should be recalled that indirect calorimetry provides an index of overall substrate disappearance rates. This is incorrectly assumed to be equivalent to substrate "oxidation" rates. Unfortunately, there is no adequate golden standard to validate whole body substrate "oxidation" rates, and this contrasts to the "validation" of heat production by indirect calorimetry, through use of direct calorimetry under strict thermal equilibrium conditions. Tracer techniques using stable (or radioactive) isotopes, represent an independent way of assessing substrate utilization rates. When carbohydrate metabolism is measured with both techniques, indirect calorimetry generally provides consistent glucose "oxidation" rates as compared to isotopic tracers, but only when certain metabolic processes (such as gluconeogenesis and lipogenesis) are minimal or / and when the respiratory quotients are not at the extreme of the physiological range. However, it is believed that the tracer techniques underestimate true glucose "oxidation" rates due to the failure to account for glycogenolysis in the tissue storing glucose, since this escapes the systemic circulation. A major advantage of isotopic techniques is that they are able to estimate (given certain assumptions) various metabolic processes (such as gluconeogenesis) in a noninvasive way. Furthermore when, in addition to the 3 macronutrients, a fourth substrate is administered (such as ethanol), isotopic quantification of substrate "oxidation" allows one to eliminate the inherent assumptions made by indirect calorimetry. In conclusion, isotopic tracers techniques and indirect calorimetry should be considered as complementary techniques, in particular since the tracer techniques require the measurement of carbon dioxide production obtained by indirect calorimetry. However, it should be kept in mind that the assessment of substrate oxidation by indirect calorimetry may involve large errors in particular over a short period of time. By indirect calorimetry, energy expenditure (heat production) is calculated with substantially less error than substrate oxidation rates.
Resumo:
We demonstrate that the step of DNA strand exchange during RecA-mediated recombination reaction can occur equally efficiently in the presence or absence of ATP hydrolysis. The polarity of strand exchange is the same when instead of ATP its non-hydrolyzable analog adenosine-5'-O-(3-thiotriphosphate) is used. We show that the ATP dependence of recombination reaction is limited to the post-exchange stages of the reactions. The low DNA affinity state of RecA protomers, induced after ATP hydrolysis, is necessary for the dissociation of RecA-DNA complexes at the end of the reaction. This dissociation of RecA from DNA is necessary for the release of recombinant DNA molecules from the complexes formed with RecA and for the recycling of RecA protomers for another round of the recombination reaction.
Resumo:
Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application. Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third 'Quest for Orthologs' meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes. The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking. AVAILABILITY AND IMPLEMENTATION: All such materials are available at http://questfororthologs.org. CONTACT: erik.sonnhammer@scilifelab.se or c.dessimoz@ucl.ac.uk.
Resumo:
The role of ATP hydrolysis during the RecA-mediated recombination reaction is addressed in this paper. Recent studies indicated that the RecA-promoted DNA strand exchange between completely homologous double- and single-stranded DNA can be very efficient in the absence of ATP hydrolysis. In this work we demonstrate that the energy derived from the ATP hydrolysis is strictly needed to drive the DNA strand exchange through the regions where the interacting DNA molecules are not in a homologous register. Therefore, in addition to the role of the ATP hydrolysis in promoting the dissociation of RecA from the products of the recombination reaction, as described earlier, ATP hydrolysis also plays a crucial role in the actual process of strand exchange, provided that the lack of homologous register obstructs the process of branch migration.
Resumo:
Abstract This thesis proposes a set of adaptive broadcast solutions and an adaptive data replication solution to support the deployment of P2P applications. P2P applications are an emerging type of distributed applications that are running on top of P2P networks. Typical P2P applications are video streaming, file sharing, etc. While interesting because they are fully distributed, P2P applications suffer from several deployment problems, due to the nature of the environment on which they perform. Indeed, defining an application on top of a P2P network often means defining an application where peers contribute resources in exchange for their ability to use the P2P application. For example, in P2P file sharing application, while the user is downloading some file, the P2P application is in parallel serving that file to other users. Such peers could have limited hardware resources, e.g., CPU, bandwidth and memory or the end-user could decide to limit the resources it dedicates to the P2P application a priori. In addition, a P2P network is typically emerged into an unreliable environment, where communication links and processes are subject to message losses and crashes, respectively. To support P2P applications, this thesis proposes a set of services that address some underlying constraints related to the nature of P2P networks. The proposed services include a set of adaptive broadcast solutions and an adaptive data replication solution that can be used as the basis of several P2P applications. Our data replication solution permits to increase availability and to reduce the communication overhead. The broadcast solutions aim, at providing a communication substrate encapsulating one of the key communication paradigms used by P2P applications: broadcast. Our broadcast solutions typically aim at offering reliability and scalability to some upper layer, be it an end-to-end P2P application or another system-level layer, such as a data replication layer. Our contributions are organized in a protocol stack made of three layers. In each layer, we propose a set of adaptive protocols that address specific constraints imposed by the environment. Each protocol is evaluated through a set of simulations. The adaptiveness aspect of our solutions relies on the fact that they take into account the constraints of the underlying system in a proactive manner. To model these constraints, we define an environment approximation algorithm allowing us to obtain an approximated view about the system or part of it. This approximated view includes the topology and the components reliability expressed in probabilistic terms. To adapt to the underlying system constraints, the proposed broadcast solutions route messages through tree overlays permitting to maximize the broadcast reliability. Here, the broadcast reliability is expressed as a function of the selected paths reliability and of the use of available resources. These resources are modeled in terms of quotas of messages translating the receiving and sending capacities at each node. To allow a deployment in a large-scale system, we take into account the available memory at processes by limiting the view they have to maintain about the system. Using this partial view, we propose three scalable broadcast algorithms, which are based on a propagation overlay that tends to the global tree overlay and adapts to some constraints of the underlying system. At a higher level, this thesis also proposes a data replication solution that is adaptive both in terms of replica placement and in terms of request routing. At the routing level, this solution takes the unreliability of the environment into account, in order to maximize reliable delivery of requests. At the replica placement level, the dynamically changing origin and frequency of read/write requests are analyzed, in order to define a set of replica that minimizes communication cost.
Resumo:
Aims: A rapid and simple HPLC-MS method was developed for the simultaneousdetermination of antidementia drugs, including donepezil, galantamine, rivastigmineand its major metabolite NAP 226 - 90, and memantine, for TherapeuticDrug Monitoring (TDM). In the elderly population treated with antidementiadrugs, the presence of several comorbidities, drug interactions resulting frompolypharmacy, and variations in drug metabolism and elimination, are possiblefactors leading to the observed high interindividual variability in plasma levels.Although evidence for the benefit of TDM for antidementia drugs still remains tobe demonstrated, an individually adapted dosage through TDM might contributeto minimize the risk of adverse reactions and to increase the probability of efficienttherapeutic response. Methods: A solid-phase extraction procedure with amixed-mode cation exchange sorbent was used to isolate the drugs from 0.5 mL ofplasma. The compounds were analyzed on a reverse-phase column with a gradientelution consisting of an ammonium acetate buffer at pH 9.3 and acetonitrile anddetected by mass spectrometry in the single ion monitoring mode. Isotope-labeledinternal standards were used for quantification where possible. The validatedmethod was used to measure the plasma levels of antidementia drugs in 300patients treated with these drugs. Results: The method was validated accordingto international standards of validation, including the assessment of the trueness(-8 - 11 %), the imprecision (repeatability: 1-5%, intermediate imprecision:2 - 9 %), selectivity and matrix effects variability (less than 6 %). Furthermore,short and long-term stability of the analytes in plasma was ascertained. Themethod proved to be robust in the calibrated ranges of 1 - 300 ng/mL for rivastigmineand memantine and 2 - 300 mg/mL for donepezil, galantamine and NAP226 - 90. We recently published a full description of the method (1). We found ahigh interindividual variability in plasma levels of these drugs in a study populationof 300 patients. The plasma level measurements, with some preliminaryclinical and pharmacogenetic results, will be presented. Conclusion: A simpleLC-MS method was developed for plasma level determination of antidementiadrugs which was successfully used in a clinical study with 300 patients.
Resumo:
The emergence of powerful new technologies, the existence of large quantities of data, and increasing demands for the extraction of added value from these technologies and data have created a number of significant challenges for those charged with both corporate and information technology management. The possibilities are great, the expectations high, and the risks significant. Organisations seeking to employ cloud technologies and exploit the value of the data to which they have access, be this in the form of "Big Data" available from different external sources or data held within the organisation, in structured or unstructured formats, need to understand the risks involved in such activities. Data owners have responsibilities towards the subjects of the data and must also, frequently, demonstrate that they are in compliance with current standards, laws and regulations. This thesis sets out to explore the nature of the technologies that organisations might utilise, identify the most pertinent constraints and risks, and propose a framework for the management of data from discovery to external hosting that will allow the most significant risks to be managed through the definition, implementation, and performance of appropriate internal control activities.