34 resultados para Annotation de génomes
em Universidad Politécnica de Madrid
Resumo:
OntoTag - A Linguistic and Ontological Annotation Model Suitable for the Semantic Web
1. INTRODUCTION. LINGUISTIC TOOLS AND ANNOTATIONS: THEIR LIGHTS AND SHADOWS
Computational Linguistics is already a consolidated research area. It builds upon the results of other two major ones, namely Linguistics and Computer Science and Engineering, and it aims at developing computational models of human language (or natural language, as it is termed in this area). Possibly, its most well-known applications are the different tools developed so far for processing human language, such as machine translation systems and speech recognizers or dictation programs.
These tools for processing human language are commonly referred to as linguistic tools. Apart from the examples mentioned above, there are also other types of linguistic tools that perhaps are not so well-known, but on which most of the other applications of Computational Linguistics are built. These other types of linguistic tools comprise POS taggers, natural language parsers and semantic taggers, amongst others. All of them can be termed linguistic annotation tools.
Linguistic annotation tools are important assets. In fact, POS and semantic taggers (and, to a lesser extent, also natural language parsers) have become critical resources for the computer applications that process natural language. Hence, any computer application that has to analyse a text automatically and ‘intelligently’ will include at least a module for POS tagging. The more an application needs to ‘understand’ the meaning of the text it processes, the more linguistic tools and/or modules it will incorporate and integrate.
However, linguistic annotation tools have still some limitations, which can be summarised as follows:
1. Normally, they perform annotations only at a certain linguistic level (that is, Morphology, Syntax, Semantics, etc.).
2. They usually introduce a certain rate of errors and ambiguities when tagging. This error rate ranges from 10 percent up to 50 percent of the units annotated for unrestricted, general texts.
3. Their annotations are most frequently formulated in terms of an annotation schema designed and implemented ad hoc.
A priori, it seems that the interoperation and the integration of several linguistic tools into an appropriate software architecture could most likely solve the limitations stated in (1). Besides, integrating several linguistic annotation tools and making them interoperate could also minimise the limitation stated in (2). Nevertheless, in the latter case, all these tools should produce annotations for a common level, which would have to be combined in order to correct their corresponding errors and inaccuracies. Yet, the limitation stated in (3) prevents both types of integration and interoperation from being easily achieved.
In addition, most high-level annotation tools rely on other lower-level annotation tools and their outputs to generate their own ones. For example, sense-tagging tools (operating at the semantic level) often use POS taggers (operating at a lower level, i.e., the morphosyntactic) to identify the grammatical category of the word or lexical unit they are annotating. Accordingly, if a faulty or inaccurate low-level annotation tool is to be used by other higher-level one in its process, the errors and inaccuracies of the former should be minimised in advance. Otherwise, these errors and inaccuracies would be transferred to (and even magnified in) the annotations of the high-level annotation tool.
Therefore, it would be quite useful to find a way to
(i) correct or, at least, reduce the errors and the inaccuracies of lower-level linguistic tools;
(ii) unify the annotation schemas of different linguistic annotation tools or, more generally speaking, make these tools (as well as their annotations) interoperate.
Clearly, solving (i) and (ii) should ease the automatic annotation of web pages by means of linguistic tools, and their transformation into Semantic Web pages (Berners-Lee, Hendler and Lassila, 2001). Yet, as stated above, (ii) is a type of interoperability problem. There again, ontologies (Gruber, 1993; Borst, 1997) have been successfully applied thus far to solve several interoperability problems. Hence, ontologies should help solve also the problems and limitations of linguistic annotation tools aforementioned.
Thus, to summarise, the main aim of the present work was to combine somehow these separated approaches, mechanisms and tools for annotation from Linguistics and Ontological Engineering (and the Semantic Web) in a sort of hybrid (linguistic and ontological) annotation model, suitable for both areas. This hybrid (semantic) annotation model should (a) benefit from the advances, models, techniques, mechanisms and tools of these two areas; (b) minimise (and even solve, when possible) some of the problems found in each of them; and (c) be suitable for the Semantic Web. The concrete goals that helped attain this aim are presented in the following section.
2. GOALS OF THE PRESENT WORK
As mentioned above, the main goal of this work was to specify a hybrid (that is, linguistically-motivated and ontology-based) model of annotation suitable for the Semantic Web (i.e. it had to produce a semantic annotation of web page contents). This entailed that the tags included in the annotations of the model had to (1) represent linguistic concepts (or linguistic categories, as they are termed in ISO/DCR (2008)), in order for this model to be linguistically-motivated; (2) be ontological terms (i.e., use an ontological vocabulary), in order for the model to be ontology-based; and (3) be structured (linked) as a collection of ontology-based
Resumo:
We present two new algorithms which perform automatic parallelization via source-to-source transformations. The objective is to exploit goal-level, unrestricted independent and-parallelism. The proposed algorithms use as targets new parallel execution primitives which are simpler and more flexible than the well-known &/2 parallel operator. This makes it possible to genérate better parallel expressions by exposing more potential parallelism among the literals of a clause than is possible with &/2. The difference between the two algorithms stems from whether the order of the solutions obtained is preserved or not. We also report on a preliminary evaluation of an implementation of our approach. We compare the performance obtained to that of previous annotation algorithms and show that relevant improvements can be obtained.
Resumo:
Actualmente, la Web provee un inmenso conjunto de servicios (WS-*, RESTful, OGC WFS), los cuales están normalmente expuestos a través de diferentes estándares que permiten localizar e invocar a estos servicios. Estos servicios están, generalmente, descritos utilizando información textual, sin una descripción formal, es decir, la descripción de los servicios es únicamente sintáctica. Para facilitar el uso y entendimiento de estos servicios, es necesario anotarlos de manera formal a través de la descripción de los metadatos. El objetivo de esta tesis es proponer un enfoque para la anotación semántica de servicios Web en el dominio geoespacial. Este enfoque permite automatizar algunas de las etapas del proceso de anotación, mediante el uso combinado de recursos ontológicos y servicios externos. Este proceso ha sido evaluado satisfactoriamente con un conjunto de servicios en el dominio geoespacial. La contribución principal de este trabajo es la automatización parcial del proceso de anotación semántica de los servicios RESTful y WFS, lo cual mejora el estado del arte en esta área. Una lista detallada de las contribuciones son: • Un modelo para representar servicios Web desde el punto de vista sintáctico y semántico, teniendo en cuenta el esquema y las instancias. • Un método para anotar servicios Web utilizando ontologías y recursos externos. • Un sistema que implementa el proceso de anotación propuesto. • Un banco de pruebas para la anotación semántica de servicios RESTful y OGC WFS. Abstract The Web contains an immense collection of Web services (WS-*, RESTful, OGC WFS), normally exposed through standards that tell us how to locate and invocate them. These services are usually described using mostly textual information and without proper formal descriptions, that is, existing service descriptions mostly stay on a syntactic level. If we want to make such services potentially easier to understand and use, we may want to annotate them formally, by means of descriptive metadata. The objective of this thesis is to propose an approach for the semantic annotation of services in the geospatial domain. Our approach automates some stages of the annotation process, by using a combination of thirdparty resources and services. It has been successfully evaluated with a set of geospatial services. The main contribution of this work is the partial automation of the process of RESTful and WFS semantic annotation services, what improves the current state of the art in this area. The more detailed list of contributions are: • A model for representing Web services. • A method for annotating Web services using ontological and external resources. • A system that implements the proposed annotation process. • A gold standard for the semantic annotation of RESTful and OGC WFS services, and algorithms for evaluating the annotations.
Resumo:
This paper describes a framework for annotation on travel blogs based on subjectivity (FATS). The framework has the capability to auto-annotate -sentence by sentence- sections from blogs (posts) about travelling in the Spanish language. FATS is used in this experiment to annotate com- ponents from travel blogs in order to create a corpus of 300 annotated posts. Each subjective element in a sentence is annotated as positive or negative as appropriate. Currently correct annotations add up to about 95 per cent in our subset of the travel domain. By means of an iterative process of annotation we can create a subjectively annotated domain specific corpus.
Resumo:
Nowadays, there is a great amount of genomic and transcriptomic data available about forest species, including ambitious projects looking for complete sequencing and annotation of different gymnosperm genomes [1, 2]. Pinus canariensis is an endemic conifer of the Canary Islands with re-sprouting capability and resilience against fire and mechanical damage, as result of an adaptation to volcanic environments. Additionally, this species has a high proportion of axial parenchyma compared with other conifers, and this tissue connects with radial parenchyma allowing transport of reserves. The most internal tracheids stop accumulating water [3], and get filled of resins and polyphenols synthesized by the axial parenchyma; this is the so-called ?torch-heartwood? [4], which avoids decay. This wood achieves very high prices due to its particular resistance to rot. These features make P. canariensis an interesting model species for the analysis of these developmental processes in conifers. In this study we aim to perform a complete transcriptome annotation during xylogenesis in Pinus canariensis, using next-generation sequencing (NGS) -Roche 454 pyrosequencing-, in order to provide a genomic resource for further analysis, including expression profiling and the identification of candidate genes for important adaptive features.
Resumo:
Protein-coding gene families are sets of similar genes with a shared evolutionary origin and, generally, with similar biological functions. In plants, the size and role of gene families has been only partially addressed. However, suitable bioinformatics tools are being developed to cluster the enormous number of sequences currently available in databases. Specifically, comparative genomic databases promise to become powerful tools for gene family annotation in plant clades. In this review, I evaluate the data retrieved from various gene family databases, the ease with which they can be extracted and how useful the extracted information is.
Resumo:
Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to congure the annotations to their specic needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation condence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
Resumo:
Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
Resumo:
This paper describes a novel architecture to introduce automatic annotation and processing of semantic sensor data within context-aware applications. Based on the well-known state-charts technologies, and represented using W3C SCXML language combined with Semantic Web technologies, our architecture is able to provide enriched higher-level semantic representations of user’s context. This capability to detect and model relevant user situations allows a seamless modeling of the actual interaction situation, which can be integrated during the design of multimodal user interfaces (also based on SCXML) for them to be adequately adapted. Therefore, the final result of this contribution can be described as a flexible context-aware SCXML-based architecture, suitable for both designing a wide range of multimodal context-aware user interfaces, and implementing the automatic enrichment of sensor data, making it available to the entire Semantic Sensor Web
Resumo:
A framework for the automatic parallelization of (constraint) logic programs is proposed and proved correct. Intuitively, the parallelization process replaces conjunctions of literals with parallel expressions. Such expressions trigger at run-time the exploitation of restricted, goal-level, independent and-parallelism. The parallelization process performs two steps. The first one builds a conditional dependency graph (which can be implified using compile-time analysis information), while the second transforms the resulting graph into linear conditional expressions, the parallel expressions of the &-Prolog language. Several heuristic algorithms for the latter ("annotation") process are proposed and proved correct. Algorithms are also given which determine if there is any loss of parallelism in the linearization process with respect to a proposed notion of maximal parallelism. Finally, a system is presented which implements the proposed approach. The performance of the different annotation algorithms is compared experimentally in this system by studying the time spent in parallelization and the effectiveness of the results in terms of speedups.
Resumo:
Logic programming systems which exploit and-parallelism among non-deterministic goals rely on notions of independence among those goals in order to ensure certain efficiency properties. "Non-strict" independence (NSI) is a more relaxed notion than the traditional notion of "strict" independence (SI) which still ensures the relevant efficiency properties and can allow considerable more parallelism than SI. However, all compilation technology developed to date has been based on SI, because of the intrinsic complexity of exploiting NSI. This is related to the fact that NSI cannot be determined "a priori" as SI. This paper filis this gap by developing a technique for compile-time detection and annotation of NSI. It also proposes algorithms for combined compiletime/ run-time detection, presenting novel run-time checks for this type of parallelism. Also, a transformation procedure to eliminate shared variables among parallel goals is presented, aimed at performing as much work as possible at compile-time. The approach is based on the knowledge of certain properties regarding the run-time instantiations of program variables —sharing and freeness— for which compile-time technology is available, with new approaches being currently proposed. Thus, the paper does not deal with the analysis itself, but rather with how the analysis results can be used to parallelize programs.
Resumo:
There has been significant interest in parallel execution models for logic programs which exploit Independent And-Parallelism (IAP). In these models, it is necessary to determine which goals are independent and therefore eligible for parallel execution and which goals have to wait for which others during execution. Although this can be done at run-time, it can imply a very heavy overhead. In this paper, we present three algorithms for automatic compiletime parallelization of logic programs using IAP. This is done by converting a clause into a graph-based computational form and then transforming this graph into linear expressions based on &-Prolog, a language for IAP. We also present an algorithm which, given a clause, determines if there is any loss of parallelism due to linearization, for the case in which only unconditional parallelism is desired. Finally, the performance of these annotation algorithms is discussed for some benchmark programs.
Resumo:
Abstract Idea Management Systems are web applications that implement the notion of open innovation though crowdsourcing. Typically, organizations use those kind of systems to connect to large communities in order to gather ideas for improvement of products or services. Originating from simple suggestion boxes, Idea Management Systems advanced beyond collecting ideas and aspire to be a knowledge management solution capable to select best ideas via collaborative as well as expert assessment methods. In practice, however, the contemporary systems still face a number of problems usually related to information overflow and recognizing questionable quality of submissions with reasonable time and effort allocation. This thesis focuses on idea assessment problem area and contributes a number of solutions that allow to filter, compare and evaluate ideas submitted into an Idea Management System. With respect to Idea Management System interoperability the thesis proposes theoretical model of Idea Life Cycle and formalizes it as the Gi2MO ontology which enables to go beyond the boundaries of a single system to compare and assess innovation in an organization wide or market wide context. Furthermore, based on the ontology, the thesis builds a number of solutions for improving idea assessment via: community opinion analysis (MARL), annotation of idea characteristics (Gi2MO Types) and study of idea relationships (Gi2MO Links). The main achievements of the thesis are: application of theoretical innovation models for practice of Idea Management to successfully recognize the differentiation between communities, opinion metrics and their recognition as a new tool for idea assessment, discovery of new relationship types between ideas and their impact on idea clustering. Finally, the thesis outcome is establishment of Gi2MO Project that serves as an incubator for Idea Management solutions and mature open-source software alternatives for the widely available commercial suites. From the academic point of view the project delivers resources to undertake experiments in the Idea Management Systems area and managed to become a forum that gathered a number of academic and industrial partners. Resumen Los Sistemas de Gestión de Ideas son aplicaciones Web que implementan el concepto de innovación abierta con técnicas de crowdsourcing. Típicamente, las organizaciones utilizan ese tipo de sistemas para conectar con comunidades grandes y así recoger ideas sobre cómo mejorar productos o servicios. Los Sistemas de Gestión de Ideas lian avanzado más allá de recoger simplemente ideas de buzones de sugerencias y ahora aspiran ser una solución de gestión de conocimiento capaz de seleccionar las mejores ideas por medio de técnicas colaborativas, así como métodos de evaluación llevados a cabo por expertos. Sin embargo, en la práctica, los sistemas contemporáneos todavía se enfrentan a una serie de problemas, que, por lo general, están relacionados con la sobrecarga de información y el reconocimiento de las ideas de dudosa calidad con la asignación de un tiempo y un esfuerzo razonables. Esta tesis se centra en el área de la evaluación de ideas y aporta una serie de soluciones que permiten filtrar, comparar y evaluar las ideas publicadas en un Sistema de Gestión de Ideas. Con respecto a la interoperabilidad de los Sistemas de Gestión de Ideas, la tesis propone un modelo teórico del Ciclo de Vida de la Idea y lo formaliza como la ontología Gi2MO que permite ir más allá de los límites de un sistema único para comparar y evaluar la innovación en un contexto amplio dentro de cualquier organización o mercado. Por otra parte, basado en la ontología, la tesis desarrolla una serie de soluciones para mejorar la evaluación de las ideas a través de: análisis de las opiniones de la comunidad (MARL), la anotación de las características de las ideas (Gi2MO Types) y el estudio de las relaciones de las ideas (Gi2MO Links). Los logros principales de la tesis son: la aplicación de los modelos teóricos de innovación para la práctica de Sistemas de Gestión de Ideas para reconocer las diferenciasentre comu¬nidades, métricas de opiniones de comunidad y su reconocimiento como una nueva herramienta para la evaluación de ideas, el descubrimiento de nuevos tipos de relaciones entre ideas y su impacto en la agrupación de estas. Por último, el resultado de tesis es el establecimiento de proyecto Gi2MO que sirve como incubadora de soluciones para Gestión de Ideas y herramientas de código abierto ya maduras como alternativas a otros sistemas comerciales. Desde el punto de vista académico, el proyecto ha provisto de recursos a ciertos experimentos en el área de Sistemas de Gestión de Ideas y logró convertirse en un foro que reunión para un número de socios tanto académicos como industriales.
Resumo:
We present new algorithms which perform automatic parallelization via source-to-source transformations. The objective is to exploit goal-level, unrestricted independent andparallelism. The proposed algorithms use as targets new parallel execution primitives which are simpler and more flexible than the well-known &/2 parallel operator, which makes it possible to generate better parallel expressions by exposing more potential parallelism among the literals of a clause than is possible with &/2. The main differences between the algorithms stem from whether the order of the solutions obtained is preserved or not, and on the use of determinacy information. We briefly describe the environment where the algorithms have been implemented and the runtime platform in which the parallelized programs are executed. We also report on an evaluation of an implementation of our approach. We compare the performance obtained to that of previous annotation algorithms and show that relevant improvements can be obtained.
Resumo:
Logic programming systems which exploit and-parallelism among non-deterministic goals rely on notions of independence among those goals in order to ensure certain efficiency properties. "Non-strict" independence (NSI) is a more relaxed notion than the traditional notion of "strict" independence (SI) which still ensures the relevant efficiency properties and can allow considerable more parallelism than SI. However, all compilation technology developed to date has been based on SI, presumably because of the intrinsic complexity of exploiting NSI. This is related to the fact that NSI cannot be determined "a priori" as SI. This paper fills this gap by developing a technique for compile-time detection and annotation of NSI. It also proposes algorithms for combined compile- time/run-time detection, presenting novel run-time checks for this type of parallelism. Also, a transformation procedure to eliminate shared variables among parallel goals is presented, attempting to perform as much work as possible at compiletime. The approach is based on the knowledge of certain properties about run-time instantiations of program variables —sharing and freeness— for which compile-time technology is available, with new approaches being currently proposed.