28 resultados para cross-language information retrieval
Resumo:
La proliferación en todos los ámbitos de la producción multimedia está dando lugar a la aparición de nuevos paradigmas de recuperación de información visual. Dentro de éstos, uno de los más significativos es el de los sistemas de recuperación de información visual, VIRS (Visual Information Retrieval Systems), en los que una de las tareas más representativas es la ordenación de una población de imágenes según su similitud con un ejemplo dado. En este trabajo se presenta una propuesta original para la evaluación de la similitud entre dos imágenes, basándose en la extensión del concepto de saliencia desde el espacio de imágenes al de características para establecer la relevancia de cada componente de dicho vector. Para ello se introducen metodologías para la cuantificación de la saliencia de valores individuales de características, para la combinación de estas cuantificaciones en procesos de comparación entre dos imágenes, y para, finalmente, establecer la mencionada ponderación de cada característica en atención a esta combinación. Se presentan igualmente los resultados de evaluar esta propuesta en una tarea de recuperación de imágenes por contenido en comparación con los obtenidos con la distancia euclídea. Esta comparación se realiza mediante la evaluación de ambos resultados por voluntarios.
Resumo:
Over the last few decades, the ever-increasing output of scientific publications has led to new challenges to keep up to date with the literature. In the biomedical area, this growth has introduced new requirements for professionals, e.g., physicians, who have to locate the exact papers that they need for their clinical and research work amongst a huge number of publications. Against this backdrop, novel information retrieval methods are even more necessary. While web search engines are widespread in many areas, facilitating access to all kinds of information, additional tools are required to automatically link information retrieved from these engines to specific biomedical applications. In the case of clinical environments, this also means considering aspects such as patient data security and confidentiality or structured contents, e.g., electronic health records (EHRs). In this scenario, we have developed a new tool to facilitate query building to retrieve scientific literature related to EHRs. Results: We have developed CDAPubMed, an open-source web browser extension to integrate EHR features in biomedical literature retrieval approaches. Clinical users can use CDAPubMed to: (i) load patient clinical documents, i.e., EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA), (ii) identify relevant terms for scientific literature search in these documents, i.e., Medical Subject Headings (MeSH), automatically driven by the CDAPubMed configuration, which advanced users can optimize to adapt to each specific situation, and (iii) generate and launch literature search queries to a major search engine, i.e., PubMed, to retrieve citations related to the EHR under examination. Conclusions: CDAPubMed is a platform-independent tool designed to facilitate literature searching using keywords contained in specific EHRs. CDAPubMed is visually integrated, as an extension of a widespread web browser, within the standard PubMed interface. It has been tested on a public dataset of HL7-CDA documents, returning significantly fewer citations since queries are focused on characteristics identified within the EHR. For instance, compared with more than 200,000 citations retrieved by breast neoplasm, fewer than ten citations were retrieved when ten patient features were added using CDAPubMed. This is an open source tool that can be freely used for non-profit purposes and integrated with other existing systems.
Resumo:
El proyecto que he realizado ha consistido en la creación de un sistema de información geográfica para el Campus Sur UPM, que puede servir de referencia para su implantación en cualquier otro campus universitario. Esta idea surge de la necesidad por parte de los usuarios de un campus de disponer de una herramienta que les permita consultar la información de los distintos lugares y servicios del campus, haciendo especial hincapié en su localización geográfica. Para ello ha sido necesario estudiar las tecnologías actuales que permiten implementar un sistema de información geográfica, dando lugar al sistema propuesto, que consiste en un conjunto de medios informáticos (hardware y software), que van a permitir al personal del campus obtener la información y localización de los elementos del campus desde su móvil. Tras realizar un análisis de los requisitos y funcionalidades que debía tener el sistema, el proyecto ha consistido en el diseño e implementación de dicho sistema. La información a consultar estará almacenada y disponible para su consulta en un equipo servidor accesible para el personal del campus. Para ello, durante la realización del proyecto, ha sido necesario crear un modelo de datos basado en el campus y cargar los datos geográficos de utilidad en una base de datos. Todo esto ha sido realizado mediante el producto software Smallword Core 4.2. Además, ha sido también necesario desplegar un software servidor que permita a los usuarios consultar dichos datos desde sus móviles vía WIFI o Internet, el producto utilizado para este fin ha sido Smallworld Geospatial Server 4.2. Para la realización de las consultas se han utilizado los servicios WMS(Web Map Service) y WFS(Web Feature Service) definidos por el OGC(Open Geospatial Consortium). Estos servicios están adaptados para la consulta de información geográfica. El sistema también está compuesto por una aplicación para dispositivos móviles con sistema operativo Android, que permite a los usuarios del sistema consultar y visualizar la información geográfica del campus. Dicha aplicación ha sido diseñada y programada a lo largo de la realización del proyecto. Para la realización de este proyecto también ha sido necesario un estudio del presupuesto que supondría una implantación real del sistema y el mantenimiento que implicaría tener el sistema actualizado. Por último, el proyecto incluye una breve descripción de las tecnologías futuras que podrían mejorar las funcionalidades del sistema: la realidad aumentada y el posicionamiento en el interior de edificios. ABSTRACT. The project I've done has been to create a geographic information system for the Campus Sur UPM, which can serve as a reference for implementation in any other college campus. This idea arises from the need for the campus users to have a tool that allows them to view information from different places and services, with particular emphasis on their geographical location. It has been necessary to study the current technologies that allow implementing a geographic information system, leading to the proposed system, which consists of a set of computer resources (hardware and software) that will allow campus users to obtain information and location of campus components from their mobile phones. Following an analysis of the requirements and functionalities that the system should have, the project involved the design and implementation of the system . The information will be stored and available on a computer server accessible to campus users. Accordingly, during the project, it was necessary to create a data model based on campus data and load this data in a database. All this has been done by Smallword Core 4.2 software product. In addition, it has also been necessary to deploy a server software that allows users to query the data from their phones via WIFI or Internet, the product used for this purpose has been Smallworld Geospatial Server 4.2 . To carry out the consultations have used the services WMS (Web Map Service) and WFS (Web Feature Service) defined by the OGC (Open Geospatial Consortium). These services are tailored to the geographic information retrieval. The system also consists of an application for mobile devices with Android operating system, which allows users to query and display geographic information related to the campus. This application has been designed and programmed over the project. For the realization of this project has also been necessary to study the budget that would be a real system implementation and the maintenance that would have the system updated. Finally, the project includes a brief description of future technologies that could improve the system's functionality: augmented reality and positioning inside the buildings.
Resumo:
Parte de la investigación biomédica actual se encuentra centrada en el análisis de datos heterogéneos. Estos datos pueden tener distinto origen, estructura, y semántica. Gran cantidad de datos de interés para los investigadores se encuentran en bases de datos públicas, que recogen información de distintas fuentes y la ponen a disposición de la comunidad de forma gratuita. Para homogeneizar estas fuentes de datos públicas con otras de origen privado, existen diversas herramientas y técnicas que permiten automatizar los procesos de homogeneización de datos heterogéneos. El Grupo de Informática Biomédica (GIB) [1] de la Universidad Politécnica de Madrid colabora en el proyecto europeo P-medicine [2], cuya finalidad reside en el desarrollo de una infraestructura que facilite la evolución de los procedimientos médicos actuales hacia la medicina personalizada. Una de las tareas enmarcadas en el proyecto P-medicine que tiene asignado el grupo consiste en elaborar herramientas que ayuden a usuarios en el proceso de integración de datos contenidos en fuentes de información heterogéneas. Algunas de estas fuentes de información son bases de datos públicas de ámbito biomédico contenidas en la plataforma NCBI [3] (National Center for Biotechnology Information). Una de las herramientas que el grupo desarrolla para integrar fuentes de datos es Ontology Annotator. En una de sus fases, la labor del usuario consiste en recuperar información de una base de datos pública y seleccionar de forma manual los resultados relevantes. Para automatizar el proceso de búsqueda y selección de resultados relevantes, por un lado existe un gran interés en conseguir generar consultas que guíen hacia resultados lo más precisos y exactos como sea posible, por otro lado, existe un gran interés en extraer información relevante de elevadas cantidades de documentos, lo cual requiere de sistemas que analicen y ponderen los datos que caracterizan a los mismos. En el campo informático de la inteligencia artificial, dentro de la rama de la recuperación de la información, existen diversos estudios acerca de la expansión de consultas a partir de retroalimentación relevante que podrían ser de gran utilidad para dar solución a la cuestión. Estos estudios se centran en técnicas para reformular o expandir la consulta inicial utilizando como realimentación los resultados que en una primera instancia fueron relevantes para el usuario, de forma que el nuevo conjunto de resultados tenga mayor proximidad con los que el usuario realmente desea. El objetivo de este trabajo de fin de grado consiste en el estudio, implementación y experimentación de métodos que automaticen el proceso de extracción de información trascendente de documentos, utilizándola para expandir o reformular consultas. De esta forma se pretende mejorar la precisión y el ranking de los resultados asociados. Dichos métodos serán integrados en la herramienta Ontology Annotator y enfocados a la fuente de datos de PubMed [4].---ABSTRACT---Part of the current biomedical research is focused on the analysis of heterogeneous data. These data may have different origin, structure and semantics. A big quantity of interesting data is contained in public databases which gather information from different sources and make it open and free to be used by the community. In order to homogenize thise sources of public data with others which origin is private, there are some tools and techniques that allow automating the processes of integration heterogeneous data. The biomedical informatics group of the Universidad Politécnica de Madrid cooperates with the European project P-medicine which main purpose is to create an infrastructure and models to facilitate the transition from current medical practice to personalized medicine. One of the tasks of the project that the group is in charge of consists on the development of tools that will help users in the process of integrating data from diverse sources. Some of the sources are biomedical public data bases from the NCBI platform (National Center for Biotechnology Information). One of the tools in which the group is currently working on for the integration of data sources is called the Ontology Annotator. In this tool there is a phase in which the user has to retrieve information from a public data base and select the relevant data contained in it manually. For automating the process of searching and selecting data on the one hand, there is an interest in automatically generating queries that guide towards the more precise results as possible. On the other hand, there is an interest on retrieve relevant information from large quantities of documents. The solution requires systems that analyze and weigh the data allowing the localization of the relevant items. In the computer science field of the artificial intelligence, in the branch of information retrieval there are diverse studies about the query expansion from relevance feedback that could be used to solve the problem. The main purpose of this studies is to obtain a set of results that is the closer as possible to the information that the user really wants to retrieve. In order to reach this purpose different techniques are used to reformulate or expand the initial query using a feedback the results that where relevant for the user, with this method, the new set of results will have more proximity with the ones that the user really desires. The goal of this final dissertation project consists on the study, implementation and experimentation of methods that automate the process of extraction of relevant information from documents using this information to expand queries. This way, the precision and the ranking of the results associated will be improved. These methods will be integrated in the Ontology Annotator tool and will focus on the PubMed data source.
Resumo:
En esta tesis se estudia la representación, modelado y comparación de colecciones mediante el uso de ontologías en el ámbito de la Web Semántica. Las colecciones, entendidas como agrupaciones de objetos o elementos con entidad propia, son construcciones que aparecen frecuentemente en prácticamente todos los dominios del mundo real, y por tanto, es imprescindible disponer de conceptualizaciones de estas estructuras abstractas y de representaciones de estas conceptualizaciones en los sistemas informáticos, que definan adecuadamente su semántica. Mientras que en muchos ámbitos de la Informática y la Inteligencia Artificial, como por ejemplo la programación, las bases de datos o la recuperación de información, las colecciones han sido ampliamente estudiadas y se han desarrollado representaciones que responden a multitud de conceptualizaciones, en el ámbito de la Web Semántica, sin embargo, su estudio ha sido bastante limitado. De hecho hasta la fecha existen pocas propuestas de representación de colecciones mediante ontologías, y las que hay sólo cubren algunos tipos de colecciones y presentan importantes limitaciones. Esto impide la representación adecuada de colecciones y dificulta otras tareas comunes como la comparación de colecciones, algo crítico en operaciones habituales como las búsquedas semánticas o el enlazado de datos en la Web Semántica. Para solventar este problema esta tesis hace una propuesta de modelización de colecciones basada en una nueva clasificación de colecciones de acuerdo a sus características estructurales (homogeneidad, unicidad, orden y cardinalidad). Esta clasificación permite definir una taxonomía con hasta 16 tipos de colecciones distintas. Entre otras ventajas, esta nueva clasificación permite aprovechar la semántica de las propiedades estructurales de cada tipo de colección para realizar comparaciones utilizando las funciones de similitud y disimilitud más apropiadas. De este modo, la tesis desarrolla además un nuevo catálogo de funciones de similitud para las distintas colecciones, donde se han recogido las funciones de (di)similitud más conocidas y también algunas nuevas. Esta propuesta se ha implementado mediante dos ontologías paralelas, la ontología E-Collections, que representa los distintos tipos de colecciones de la taxonomía y su axiomática, y la ontología SIMEON (Similarity Measures Ontology) que representa los tipos de funciones de (di)similitud para cada tipo de colección. Gracias a estas ontologías, para comparar dos colecciones, una vez representadas como instancias de la clase más apropiada de la ontología E-Collections, automáticamente se sabe qué funciones de (di)similitud de la ontología SIMEON pueden utilizarse para su comparación. Abstract This thesis studies the representation, modeling and comparison of collections in the Semantic Web using ontologies. Collections, understood as groups of objects or elements with their own identities, are constructions that appear frequently in almost all areas of the real world. Therefore, it is essential to have conceptualizations of these abstract structures and representations of these conceptualizations in computer systems, that define their semantic properly. While in many areas of Computer Science and Artificial Intelligence, such as Programming, Databases or Information Retrieval, the collections have been extensively studied and there are representations that match many conceptualizations, in the field Semantic Web, however, their study has been quite limited. In fact, there are few representations of collections using ontologies so far, and they only cover some types of collections and have important limitations. This hinders a proper representation of collections and other common tasks like comparing collections, something critical in usual operations such as semantic search or linking data on the Semantic Web. To solve this problem this thesis makes a proposal for modelling collections based on a new classification of collections according to their structural characteristics (homogeneity, uniqueness, order and cardinality). This classification allows to define a taxonomy with up to 16 different types of collections. Among other advantages, this new classification can leverage the semantics of the structural properties of each type of collection to make comparisons using the most appropriate (dis)similarity functions. Thus, the thesis also develops a new catalog of similarity functions for the different types of collections. This catalog contains the most common (dis)similarity functions as well as new ones. This proposal is implemented through two parallel ontologies, the E-Collections ontology that represents the different types of collections in the taxonomy and their axiomatic, and the SIMEON ontology (Similarity Measures Ontology) that represents the types of (dis)similarity functions for each type of collection. Thanks to these ontologies, to compare two collections, once represented as instances of the appropriate class of E-Collections ontology, we can know automatically which (dis)similarity functions of the SIMEON ontology are suitable for the comparison. Finally, the feasibility and usefulness of this modeling and comparison of collections proposal is proved in the field of oenology, applying both E-Collections and SIMEON ontologies to the representation and comparison of wines with the E-Baco ontology.
Resumo:
We present an evaluation of a spoken language dialogue system with a module for the management of userrelated information, stored as user preferences and privileges. The flexibility of our dialogue management approach, based on Bayesian Networks (BN), together with a contextual information module, which performs different strategies for handling such information, allows us to include user information as a new level into the Context Manager hierarchy. We propose a set of objective and subjective metrics to measure the relevance of the different contextual information sources. The analysis of our evaluation scenarios shows that the relevance of the short-term information (i.e. the system status) remains pretty stable throughout the dialogue, whereas the dialogue history and the user profile (i.e. the middle-term and the long-term information, respectively) play a complementary role, evolving their usefulness as the dialogue evolves.
Resumo:
Recently, the Semantic Web has experienced significant advancements in standards and techniques, as well as in the amount of semantic information available online. Nevertheless, mechanisms are still needed to automatically reconcile information when it is expressed in different natural languages on the Web of Data, in order to improve the access to semantic information across language barriers. In this context several challenges arise [1], such as: (i) ontology translation/localization, (ii) cross-lingual ontology mappings, (iii) representation of multilingual lexical information, and (iv) cross-lingual access and querying of linked data. In the following we will focus on the second challenge, which is the necessity of establishing, representing and storing cross-lingual links among semantic information on the Web. In fact, in a “truly” multilingual Semantic Web, semantic data with lexical representations in one natural language would be mapped to equivalent or related information in other languages, thus making navigation across multilingual information possible for software agents.
Resumo:
Recently, the Semantic Web has experienced signi�cant advancements in standards and techniques, as well as in the amount of semantic information available online. Even so, mechanisms are still needed to automatically reconcile semantic information when it is expressed in di�erent natural languages, so that access to Web information across language barriers can be improved. That requires developing techniques for discovering and representing cross-lingual links on the Web of Data. In this paper we explore the different dimensions of such a problem and reflect on possible avenues of research on that topic.
Resumo:
We present two approaches to cluster dialogue-based information obtained by the speech understanding module and the dialogue manager of a spoken dialogue system. The purpose is to estimate a language model related to each cluster, and use them to dynamically modify the model of the speech recognizer at each dialogue turn. In the first approach we build the cluster tree using local decisions based on a Maximum Normalized Mutual Information criterion. In the second one we take global decisions, based on the optimization of the global perplexity of the combination of the cluster-related LMs. Our experiments show a relative reduction of the word error rate of 15.17%, which helps to improve the performance of the understanding and the dialogue manager modules.
Resumo:
We present two approaches to cluster dialogue-based information obtained by the speech understanding module and the dialogue manager of a spoken dialogue system. The purpose is to estimate a language model related to each cluster, and use them to dynamically modify the model of the speech recognizer at each dialogue turn. In the first approach we build the cluster tree using local decisions based on a Maximum Normalized Mutual Information criterion. In the second one we take global decisions, based on the optimization of the global perplexity of the combination of the cluster-related LMs. Our experiments show a relative reduction of the word error rate of 15.17%, which helps to improve the performance of the understanding and the dialogue manager modules.
Resumo:
We present an approach to adapt dynamically the language models (LMs) used by a speech recognizer that is part of a spoken dialogue system. We have developed a grammar generation strategy that automatically adapts the LMs using the semantic information that the user provides (represented as dialogue concepts), together with the information regarding the intentions of the speaker (inferred by the dialogue manager, and represented as dialogue goals). We carry out the adaptation as a linear interpolation between a background LM, and one or more of the LMs associated to the dialogue elements (concepts or goals) addressed by the user. The interpolation weights between those models are automatically estimated on each dialogue turn, using measures such as the posterior probabilities of concepts and goals, estimated as part of the inference procedure to determine the actions to be carried out. We propose two approaches to handle the LMs related to concepts and goals. Whereas in the first one we estimate a LM for each one of them, in the second one we apply several clustering strategies to group together those elements that share some common properties, and estimate a LM for each cluster. Our evaluation shows how the system can estimate a dynamic model adapted to each dialogue turn, which helps to improve the performance of the speech recognition (up to a 14.82% of relative improvement), which leads to an improvement in both the language understanding and the dialogue management tasks.
Resumo:
This paper describes the application of language translation technologies for generating bus information in Spanish Sign Language (LSE: Lengua de Signos Española). In this work, two main systems have been developed: the first for translating text messages from information panels and the second for translating spoken Spanish into natural conversations at the information point of the bus company. Both systems are made up of a natural language translator (for converting a word sentence into a sequence of LSE signs), and a 3D avatar animation module (for playing back the signs). For the natural language translator, two technological approaches have been analyzed and integrated: an example-based strategy and a statistical translator. When translating spoken utterances, it is also necessary to incorporate a speech recognizer for decoding the spoken utterance into a word sequence, prior to the language translation module. This paper includes a detailed description of the field evaluation carried out in this domain. This evaluation has been carried out at the customer information office in Madrid involving both real bus company employees and deaf people. The evaluation includes objective measurements from the system and information from questionnaires. In the field evaluation, the whole translation presents an SER (Sign Error Rate) of less than 10% and a BLEU greater than 90%.
Resumo:
En los últimos años han surgido nuevos campos de las tecnologías de la información que exploran el tratamiento de la gran cantidad de datos digitales existentes y cómo transformarlos en conocimiento explícito. Las técnicas de Procesamiento del Lenguaje Natural (NLP) son capaces de extraer información de los textos digitales presentados en forma narrativa. Además, las técnicas de machine learning clasifican instancias o ejemplos en función de sus atributos, en distintas categorías, aprendiendo de otros previamente clasificados. Los textos clínicos son una gran fuente de información no estructurada; en consecuencia, información no explotada en su totalidad. Algunos términos usados en textos clínicos se encuentran en una situación de afirmación, negación, hipótesis o histórica. La detección de esta situación es necesaria para la estructuración de información, pero a su vez tiene una gran complejidad. Extrayendo características lingüísticas de los elementos, o tokens, de los textos mediante NLP; transformando estos tokens en instancias y las características en atributos, podemos mediante técnicas de machine learning clasificarlos con el objetivo de detectar si se encuentran afirmados, negados, hipotéticos o históricos. La selección de los atributos que cada token debe tener para su clasificación, así como la selección del algoritmo de machine learning utilizado son elementos cruciales para la clasificación. Son, de hecho, los elementos que componen el modelo de clasificación. Consecuentemente, este trabajo aborda el proceso de extracción de características, selección de atributos y selección del algoritmo de machine learning para la detección de la negación en textos clínicos en español. Se expone un modelo para la clasificación que, mediante el algoritmo J48 y 35 atributos obtenidos de características lingüísticas (morfológicas y sintácticas) y disparadores de negación, detecta si un token está negado en 465 frases provenientes de textos clínicos con un F-Score del 73%, una exhaustividad del 66% y una precisión del 81% con una validación cruzada de 10 iteraciones. ---ABSTRACT--- New information technologies have emerged in the recent years which explore the processing of the huge amount of existing digital data and its transformation into knowledge. Natural Language Processing (NLP) techniques are able to extract certain features from digital texts. Additionally, through machine learning techniques it is feasible to classify instances according to different categories, learning from others previously classified. Clinical texts contain great amount of unstructured data, therefore information not fully exploited. Some terms (tokens) in clinical texts appear in different situations such as affirmed, negated, hypothetic or historic. Detecting this situation is necessary for the structuring of this data, however not simple. It is possible to detect whether if a token is negated, affirmed, hypothetic or historic by extracting its linguistic features by NLP; transforming these tokens into instances, the features into attributes, and classifying these instances through machine learning techniques. Selecting the attributes each instance must have, and choosing the machine learning algorithm are crucial issues for the classification. In fact, these elements set the classification model. Consequently, this work approaches the features retrieval as well as the attributes and algorithm selection process used by machine learning techniques for the detection of negation in clinical texts in Spanish. We present a classification model which, through J48 algorithm and 35 attributes from linguistic features (morphologic and syntactic) and negation triggers, detects whether if a token is negated in 465 sentences from historical records, with a result of 73% FScore, 66% recall and 81% precision using a 10-fold cross-validation.