8 resultados para Information Retrieval, Document Databases, Digital Libraries
em Université de Lausanne, Switzerland
Resumo:
BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.
Resumo:
Abstract Textual autocorrelation is a broad and pervasive concept, referring to the similarity between nearby textual units: lexical repetitions along consecutive sentences, semantic association between neighbouring lexemes, persistence of discourse types (narrative, descriptive, dialogal...) and so on. Textual autocorrelation can also be negative, as illustrated by alternating phonological or morpho-syntactic categories, or the succession of word lengths. This contribution proposes a general Markov formalism for textual navigation, and inspired by spatial statistics. The formalism can express well-known constructs in textual data analysis, such as term-document matrices, references and hyperlinks navigation, (web) information retrieval, and in particular textual autocorrelation, as measured by Moran's I relatively to the exchange matrix associated to neighbourhoods of various possible types. Four case studies (word lengths alternation, lexical repulsion, parts of speech autocorrelation, and semantic autocorrelation) illustrate the theory. In particular, one observes a short-range repulsion between nouns together with a short-range attraction between verbs, both at the lexical and semantic levels. Résumé: Le concept d'autocorrélation textuelle, fort vaste, réfère à la similarité entre unités textuelles voisines: répétitions lexicales entre phrases successives, association sémantique entre lexèmes voisins, persistance du type de discours (narratif, descriptif, dialogal...) et ainsi de suite. L'autocorrélation textuelle peut être également négative, comme l'illustrent l'alternance entre les catégories phonologiques ou morpho-syntaxiques, ou la succession des longueurs de mots. Cette contribution propose un formalisme markovien général pour la navigation textuelle, inspiré par la statistique spatiale. Le formalisme est capable d'exprimer des constructions bien connues en analyse des données textuelles, telles que les matrices termes-documents, les références et la navigation par hyperliens, la recherche documentaire sur internet, et, en particulier, l'autocorélation textuelle, telle que mesurée par le I de Moran relatif à une matrice d'échange associée à des voisinages de différents types possibles. Quatre cas d'étude illustrent la théorie: alternance des longueurs de mots, répulsion lexicale, autocorrélation des catégories morpho-syntaxiques et autocorrélation sémantique. On observe en particulier une répulsion à courte portée entre les noms, ainsi qu'une attraction à courte portée entre les verbes, tant au niveau lexical que sémantique.
Resumo:
Background: The variety of DNA microarray formats and datasets presently available offers an unprecedented opportunity to perform insightful comparisons of heterogeneous data. Cross-species studies, in particular, have the power of identifying conserved, functionally important molecular processes. Validation of discoveries can now often be performed in readily available public data which frequently requires cross-platform studies.Cross-platform and cross-species analyses require matching probes on different microarray formats. This can be achieved using the information in microarray annotations and additional molecular biology databases, such as orthology databases. Although annotations and other biological information are stored using modern database models ( e. g. relational), they are very often distributed and shared as tables in text files, i.e. flat file databases. This common flat database format thus provides a simple and robust solution to flexibly integrate various sources of information and a basis for the combined analysis of heterogeneous gene expression profiles.Results: We provide annotationTools, a Bioconductor-compliant R package to annotate microarray experiments and integrate heterogeneous gene expression profiles using annotation and other molecular biology information available as flat file databases. First, annotationTools contains a specialized set of functions for mining this widely used database format in a systematic manner. It thus offers a straightforward solution for annotating microarray experiments. Second, building on these basic functions and relying on the combination of information from several databases, it provides tools to easily perform cross-species analyses of gene expression data.Here, we present two example applications of annotationTools that are of direct relevance for the analysis of heterogeneous gene expression profiles, namely a cross-platform mapping of probes and a cross-species mapping of orthologous probes using different orthology databases. We also show how to perform an explorative comparison of disease-related transcriptional changes in human patients and in a genetic mouse model.Conclusion: The R package annotationTools provides a simple solution to handle microarray annotation and orthology tables, as well as other flat molecular biology databases. Thereby, it allows easy integration and analysis of heterogeneous microarray experiments across different technological platforms or species.
Resumo:
The manipulation of DNA is routine practice in botanical research and has made a huge impact on plant breeding, biotechnology and biodiversity evaluation. DNA is easy to extract from most plant tissues and can be stored for long periods in DNA banks. Curation methods are well developed for other botanical resources such as herbaria, seed banks and botanic gardens, but procedures for the establishment and maintenance of DNA banks have not been well documented. This paper reviews the curation of DNA banks for the characterisation and utilisation of biodiversity and provides guidelines for DNA bank management. It surveys existing DNA banks and outlines their operation. It includes a review of plant DNA collection, preservation, isolation, storage, database management and exchange procedures. We stress that DNA banks require full integration with existing collections such as botanic gardens, herbaria and seed banks, and information retrieval systems that link such facilities, bioinformatic resources and other DNA banks. They also require efficient and well-regulated sample exchange procedures. Only with appropriate curation will maximum utilisation of DNA collections be achieved.
Resumo:
In this paper we propose a novel unsupervised approach to learning domain-specific ontologies from large open-domain text collections. The method is based on the joint exploitation of Semantic Domains and Super Sense Tagging for Information Retrieval tasks. Our approach is able to retrieve domain specific terms and concepts while associating them with a set of high level ontological types, named supersenses, providing flat ontologies characterized by very high accuracy and pertinence to the domain.
Resumo:
Les catastrophes sont souvent perçues comme des événements rapides et aléatoires. Si les déclencheurs peuvent être soudains, les catastrophes, elles, sont le résultat d'une accumulation des conséquences d'actions et de décisions inappropriées ainsi que du changement global. Pour modifier cette perception du risque, des outils de sensibilisation sont nécessaires. Des méthodes quantitatives ont été développées et ont permis d'identifier la distribution et les facteurs sous- jacents du risque.¦Le risque de catastrophes résulte de l'intersection entre aléas, exposition et vulnérabilité. La fréquence et l'intensité des aléas peuvent être influencées par le changement climatique ou le déclin des écosystèmes, la croissance démographique augmente l'exposition, alors que l'évolution du niveau de développement affecte la vulnérabilité. Chacune de ses composantes pouvant changer, le risque est dynamique et doit être réévalué périodiquement par les gouvernements, les assurances ou les agences de développement. Au niveau global, ces analyses sont souvent effectuées à l'aide de base de données sur les pertes enregistrées. Nos résultats montrent que celles-ci sont susceptibles d'être biaisées notamment par l'amélioration de l'accès à l'information. Elles ne sont pas exhaustives et ne donnent pas d'information sur l'exposition, l'intensité ou la vulnérabilité. Une nouvelle approche, indépendante des pertes reportées, est donc nécessaire.¦Les recherches présentées ici ont été mandatées par les Nations Unies et par des agences oeuvrant dans le développement et l'environnement (PNUD, l'UNISDR, la GTZ, le PNUE ou l'UICN). Ces organismes avaient besoin d'une évaluation quantitative sur les facteurs sous-jacents du risque, afin de sensibiliser les décideurs et pour la priorisation des projets de réduction des risques de désastres.¦La méthode est basée sur les systèmes d'information géographique, la télédétection, les bases de données et l'analyse statistique. Une importante quantité de données (1,7 Tb) et plusieurs milliers d'heures de calculs ont été nécessaires. Un modèle de risque global a été élaboré pour révéler la distribution des aléas, de l'exposition et des risques, ainsi que pour l'identification des facteurs de risque sous- jacent de plusieurs aléas (inondations, cyclones tropicaux, séismes et glissements de terrain). Deux indexes de risque multiples ont été générés pour comparer les pays. Les résultats incluent une évaluation du rôle de l'intensité de l'aléa, de l'exposition, de la pauvreté, de la gouvernance dans la configuration et les tendances du risque. Il apparaît que les facteurs de vulnérabilité changent en fonction du type d'aléa, et contrairement à l'exposition, leur poids décroît quand l'intensité augmente.¦Au niveau local, la méthode a été testée pour mettre en évidence l'influence du changement climatique et du déclin des écosystèmes sur l'aléa. Dans le nord du Pakistan, la déforestation induit une augmentation de la susceptibilité des glissements de terrain. Les recherches menées au Pérou (à base d'imagerie satellitaire et de collecte de données au sol) révèlent un retrait glaciaire rapide et donnent une évaluation du volume de glace restante ainsi que des scénarios sur l'évolution possible.¦Ces résultats ont été présentés à des publics différents, notamment en face de 160 gouvernements. Les résultats et les données générées sont accessibles en ligne (http://preview.grid.unep.ch). La méthode est flexible et facilement transposable à des échelles et problématiques différentes, offrant de bonnes perspectives pour l'adaptation à d'autres domaines de recherche.¦La caractérisation du risque au niveau global et l'identification du rôle des écosystèmes dans le risque de catastrophe est en plein développement. Ces recherches ont révélés de nombreux défis, certains ont été résolus, d'autres sont restés des limitations. Cependant, il apparaît clairement que le niveau de développement configure line grande partie des risques de catastrophes. La dynamique du risque est gouvernée principalement par le changement global.¦Disasters are often perceived as fast and random events. If the triggers may be sudden, disasters are the result of an accumulation of actions, consequences from inappropriate decisions and from global change. To modify this perception of risk, advocacy tools are needed. Quantitative methods have been developed to identify the distribution and the underlying factors of risk.¦Disaster risk is resulting from the intersection of hazards, exposure and vulnerability. The frequency and intensity of hazards can be influenced by climate change or by the decline of ecosystems. Population growth increases the exposure, while changes in the level of development affect the vulnerability. Given that each of its components may change, the risk is dynamic and should be reviewed periodically by governments, insurance companies or development agencies. At the global level, these analyses are often performed using databases on reported losses. Our results show that these are likely to be biased in particular by improvements in access to information. International losses databases are not exhaustive and do not give information on exposure, the intensity or vulnerability. A new approach, independent of reported losses, is necessary.¦The researches presented here have been mandated by the United Nations and agencies working in the development and the environment (UNDP, UNISDR, GTZ, UNEP and IUCN). These organizations needed a quantitative assessment of the underlying factors of risk, to raise awareness amongst policymakers and to prioritize disaster risk reduction projects.¦The method is based on geographic information systems, remote sensing, databases and statistical analysis. It required a large amount of data (1.7 Tb of data on both the physical environment and socio-economic parameters) and several thousand hours of processing were necessary. A comprehensive risk model was developed to reveal the distribution of hazards, exposure and risk, and to identify underlying risk factors. These were performed for several hazards (e.g. floods, tropical cyclones, earthquakes and landslides). Two different multiple risk indexes were generated to compare countries. The results include an evaluation of the role of the intensity of the hazard, exposure, poverty, governance in the pattern and trends of risk. It appears that the vulnerability factors change depending on the type of hazard, and contrary to the exposure, their weight decreases as the intensity increases.¦Locally, the method was tested to highlight the influence of climate change and the ecosystems decline on the hazard. In northern Pakistan, deforestation exacerbates the susceptibility of landslides. Researches in Peru (based on satellite imagery and ground data collection) revealed a rapid glacier retreat and give an assessment of the remaining ice volume as well as scenarios of possible evolution.¦These results were presented to different audiences, including in front of 160 governments. The results and data generated are made available online through an open source SDI (http://preview.grid.unep.ch). The method is flexible and easily transferable to different scales and issues, with good prospects for adaptation to other research areas. The risk characterization at a global level and identifying the role of ecosystems in disaster risk is booming. These researches have revealed many challenges, some were resolved, while others remained limitations. However, it is clear that the level of development, and more over, unsustainable development, configures a large part of disaster risk and that the dynamics of risk is primarily governed by global change.
Resumo:
BACKGROUND: DNA sequence integrity, mRNA concentrations and protein-DNA interactions have been subject to genome-wide analyses based on microarrays with ever increasing efficiency and reliability over the past fifteen years. However, very recently novel technologies for Ultra High-Throughput DNA Sequencing (UHTS) have been harnessed to study these phenomena with unprecedented precision. As a consequence, the extensive bioinformatics environment available for array data management, analysis, interpretation and publication must be extended to include these novel sequencing data types. DESCRIPTION: MIMAS was originally conceived as a simple, convenient and local Microarray Information Management and Annotation System focused on GeneChips for expression profiling studies. MIMAS 3.0 enables users to manage data from high-density oligonucleotide SNP Chips, expression arrays (both 3'UTR and tiling) and promoter arrays, BeadArrays as well as UHTS data using MIAME-compliant standardized vocabulary. Importantly, researchers can export data in MAGE-TAB format and upload them to the EBI's ArrayExpress certified data repository using a one-step procedure. CONCLUSION: We have vastly extended the capability of the system such that it processes the data output of six types of GeneChips (Affymetrix), two different BeadArrays for mRNA and miRNA (Illumina) and the Genome Analyzer (a popular Ultra-High Throughput DNA Sequencer, Illumina), without compromising on its flexibility and user-friendliness. MIMAS, appropriately renamed into Multiomics Information Management and Annotation System, is currently used by scientists working in approximately 50 academic laboratories and genomics platforms in Switzerland and France. MIMAS 3.0 is freely available via http://multiomics.sourceforge.net/.