711 resultados para Annotation informatisée
Resumo:
Various types of physical mapping data were assembled by developing a set of computer programs (Integrated Mapping Package) to derive a detailed, annotated map of a 4-Mb region of human chromosome 13 that includes the BRCA2 locus. The final assembly consists of a yeast artificial chromosome (YAC) contig with 42 members spanning the 13q12-13 region and aligned contigs of 399 cosmids established by cross-hybridization between the cosmids, which were selected from a chromosome 13-specific cosmid library using inter-Alu PCR probes from the YACs. The end sequences of 60 cosmids spaced nearly evenly across the map were used to generate sequence-tagged sites (STSs), which were mapped to the YACs by PCR. A contig framework was generated by STS content mapping, and the map was assembled on this scaffold. Additional annotation was provided by 72 expressed sequences and 10 genetic markers that were positioned on the map by hybridization to cosmids.
Resumo:
The field of natural language processing (NLP) has seen a dramatic shift in both research direction and methodology in the past several years. In the past, most work in computational linguistics tended to focus on purely symbolic methods. Recently, more and more work is shifting toward hybrid methods that combine new empirical corpus-based methods, including the use of probabilistic and information-theoretic techniques, with traditional symbolic methods. This work is made possible by the recent availability of linguistic databases that add rich linguistic annotation to corpora of natural language text. Already, these methods have led to a dramatic improvement in the performance of a variety of NLP systems with similar improvement likely in the coming years. This paper focuses on these trends, surveying in particular three areas of recent progress: part-of-speech tagging, stochastic parsing, and lexical semantics.
Resumo:
Predecir la función biológica de secuencias de Ácido Desoxirribonucleico (ADN) es unos de los mayores desafíos a los que se enfrenta la Bioinformática. Esta tarea se denomina anotación funcional y es un proceso complejo, laborioso y que requiere mucho tiempo. Dado su impacto en investigaciones y anotaciones futuras, la anotación debe ser lo más able y precisa posible. Idealmente, las secuencias deberían ser estudiadas y anotadas manualmente por un experto, garantizando así resultados precisos y de calidad. Sin embargo, la anotación manual solo es factible para pequeños conjuntos de datos o genomas de referencia. Con la llegada de las nuevas tecnologías de secuenciación, el volumen de datos ha crecido signi cativamente, haciendo aún más crítica la necesidad de implementaciones automáticas del proceso. Por su parte, la anotación automática es capaz de manejar grandes cantidades de datos y producir un análisis consistente. Otra ventaja de esta aproximación es su rapidez y bajo coste en relación a la manual. Sin embargo, sus resultados son menos precisos que los manuales y, en general, deben ser revisados ( curados ) por un experto. Aunque los procesos colaborativos de la anotación en comunidad pueden ser utilizados para reducir este cuello de botella, los esfuerzos en esta línea no han tenido hasta ahora el éxito esperado. Además, el problema de la anotación, como muchos otros en el dominio de la Bioinformática, abarca información heterogénea, distribuida y en constante evolución. Una posible aproximación para superar estos problemas consiste en cambiar el foco del proceso de los expertos individuales a su comunidad, y diseñar las herramientas de manera que faciliten la gestión del conocimiento y los recursos. Este trabajo adopta esta línea y propone MASSA (Multi-Agent System to Support functional Annotation), una arquitectura de Sistema Multi-Agente (SMA) para Soportar la Anotación funcional...
Resumo:
Aportaciones sobre la investigación de los destinos turísticos litorales mediterráneos, vertidas en el marco las jornadas de intercambio y transferencia de resultados celebradas en mayo de 2010, con la participación del Grupo de Investigación sobre Sostenibilidad y Territorio (GIST) de la Universidad de les Illes Balears, el Grupo de Investigación en Análisis Territorial y Estudios Turísticos (GRATET) de la Universidad Rovira i Virgili y el Grupo de Investigación en Planificación y Gestión Sostenible del Turismo de la Universidad de Alicante. Durante la mismas se debatieron y avanzaron planteamientos teóricos, metodológicos y aplicados acerca de la implantación territorial del turismo en el litoral Mediterráneo español.
Resumo:
The importance of the new textual genres such as blogs or forum entries is growing in parallel with the evolution of the Social Web. This paper presents two corpora of blog posts in English and in Spanish, annotated according to the EmotiBlog annotation scheme. Furthermore, we created 20 factual and opinionated questions for each language and also the Gold Standard for their answers in the corpus. The purpose of our work is to study the challenges involved in a mixed fact and opinion question answering setting by comparing the performance of two Question Answering (QA) systems as far as mixed opinion and factual setting is concerned. The first one is open domain, while the second one is opinion-oriented. We evaluate separately the two systems in both languages and propose possible solutions to improve QA systems that have to process mixed questions.
Resumo:
This paper shows a system about the recognition of temporal expressions in Spanish and the resolution of their temporal reference. For the identification and recognition of temporal expressions we have based on a temporal expression grammar and for the resolution on an inference engine, where we have the information necessary to do the date operation based on the recognized expressions. For further information treatment, the output is proposed by means of XML tags in order to add standard information of the resolution obtained. Different kinds of annotation of temporal expressions are explained in another articles [WILSON2001][KATZ2001]. In the evaluation of our proposal we have obtained successful results.
Resumo:
This paper shows an empirical study about the anaphoric accessibility space in Spanish dialogues. According to this study, antecedents of pronominal and adjectival anaphors can almost always (95.9%) be found in the noun phrases set taken from spaces defined using a structure based on adjacency pairs. Furthermore, a proposal of a reliable annotation scheme for Spanish dialogues is presented in order to define this anaphoric accessibility space. Using this annotation scheme, anaphora resolution algorithms can locate the adequate set of anaphor antecedent candidates.
Resumo:
Preliminary research demonstrated the EmotiBlog annotated corpus relevance as a Machine Learning resource to detect subjective data. In this paper we compare EmotiBlog with the JRC Quotes corpus in order to check the robustness of its annotation. We concentrate on its coarse-grained labels and carry out a deep Machine Learning experimentation also with the inclusion of lexical resources. The results obtained show a similarity with the ones obtained with the JRC Quotes corpus demonstrating the EmotiBlog validity as a resource for the SA task.
Resumo:
The development of the Web 2.0 led to the birth of new textual genres such as blogs, reviews or forum entries. The increasing number of such texts and the highly diverse topics they discuss make blogs a rich source for analysis. This paper presents a comparative study on open domain and opinion QA systems. A collection of opinion and mixed fact-opinion questions in English is defined and two Question Answering systems are employed to retrieve the answers to these queries. The first one is generic, while the second is specific for emotions. We comparatively evaluate and analyze the systems’ results, concluding that opinion Question Answering requires the use of specific resources and methods.
Resumo:
The exponential growth of the subjective information in the framework of the Web 2.0 has led to the need to create Natural Language Processing tools able to analyse and process such data for multiple practical applications. They require training on specifically annotated corpora, whose level of detail must be fine enough to capture the phenomena involved. This paper presents EmotiBlog – a fine-grained annotation scheme for subjectivity. We show the manner in which it is built and demonstrate the benefits it brings to the systems using it for training, through the experiments we carried out on opinion mining and emotion detection. We employ corpora of different textual genres –a set of annotated reported speech extracted from news articles, the set of news titles annotated with polarity and emotion from the SemEval 2007 (Task 14) and ISEAR, a corpus of real-life self-expressed emotion. We also show how the model built from the EmotiBlog annotations can be enhanced with external resources. The results demonstrate that EmotiBlog, through its structure and annotation paradigm, offers high quality training data for systems dealing both with opinion mining, as well as emotion detection.
Resumo:
This paper presents the first version of EmotiBlog, an annotation scheme for emotions in non-traditional textual genres such as blogs or forums. We collected a corpus composed by blog posts in three languages: English, Spanish and Italian and about three topics of interest. Subsequently, we annotated our collection and carried out the inter-annotator agreement and a ten-fold cross-validation evaluation, obtaining promising results. The main aim of this research is to provide a finer-grained annotation scheme and annotated data that are essential to perform evaluation focused on checking the quality of the created resources.
Resumo:
This paper presents a preliminary study in which Machine Learning experiments applied to Opinion Mining in blogs have been carried out. We created and annotated a blog corpus in Spanish using EmotiBlog. We evaluated the utility of the features labelled firstly carrying out experiments with combinations of them and secondly using the feature selection techniques, we also deal with several problems, such as the noisy character of the input texts, the small size of the training set, the granularity of the annotation scheme and the language object of our study, Spanish, with less resource than English. We obtained promising results considering that it is a preliminary study.
Resumo:
EmotiBlog is a corpus labelled with the homonymous annotation schema designed for detecting subjectivity in the new textual genres. Preliminary research demonstrated its relevance as a Machine Learning resource to detect opinionated data. In this paper we compare EmotiBlog with the JRC corpus in order to check the EmotiBlog robustness of annotation. For this research we concentrate on its coarse-grained labels. We carry out a deep ML experimentation also with the inclusion of lexical resources. The results obtained show a similarity with the ones obtained with the JRC demonstrating the EmotiBlog validity as a resource for the SA task.
Resumo:
In this paper a multilingual method for event ordering based on temporal expression resolution is presented. This method has been implemented through the TERSEO system which consists of three main units: temporal expression recognizing, resolution of the coreference introduced by these expressions, and event ordering. By means of this system, chronological information related to events can be extracted from documental databases. This information is automatically added to the documental database in order to allow its use by question answering systems in those cases referring to temporality. The system has been evaluated obtaining results of 91 % precision and 71 % recall. For this, a blind evaluation process has been developed guaranteing a reliable annotation process that was measured through the kappa factor.
Resumo:
The McCabe-Thiele method is a classical approximate graphical method for the conceptual design of binary distillation columns which is still widely used, mainly for didactical purposes, though it is also valuable for quick preliminary calculations. Nevertheless, no complete description of the method has been found and situations such as different thermal feed conditions, multiple feeds, possibilities to extract by-products or to add or remove heat, are not always considered. In the present work we provide a systematic analysis of such situations by developing the generalized equations for: a) the operating lines (OL) of each sector, and b) the changeover line that provides the connection between two consecutive trays of the corresponding sectors separated by a lateral stream of feed, product, or a heat removal or addition.