19 resultados para Ontologies (Information Retrieval)
em Universidad Politécnica de Madrid
Resumo:
The main goal of the bilingual and monolingual participation of the MIRACLE team in CLEF 2004 was to test the effect of combination approaches on information retrieval. The starting point was a set of basic components: stemming, transformation, filtering, generation of n-grams, weighting and relevance feedback. Some of these basic components were used in different combinations and order of application for document indexing and for query processing. A second order combination was also tested, mainly by averaging or selective combination of the documents retrieved by different approaches for a particular query.
Resumo:
This paper describes the first set of experiments defined by the MIRACLE (Multilingual Information RetrievAl for the CLEf campaign) research group for some of the cross language tasks defined by CLEF. These experiments combine different basic techniques, linguistic-oriented and statistic-oriented, to be applied to the indexing and retrieval processes.
Resumo:
ImageCLEF is a pilot experiment run at CLEF 2003 for cross language image retrieval using textual captions related to image contents. In this paper, we describe the participation of the MIRACLE research team (Multilingual Information RetrievAl at CLEF), detailing the different experiments and discussing their preliminary results.
Resumo:
In the beginning of the 90s, ontology development was similar to an art: ontology developers did not have clear guidelines on how to build ontologies but only some design criteria to be followed. Work on principles, methods and methodologies, together with supporting technologies and languages, made ontology development become an engineering discipline, the so-called Ontology Engineering. Ontology Engineering refers to the set of activities that concern the ontology development process and the ontology life cycle, the methods and methodologies for building ontologies, and the tool suites and languages that support them. Thanks to the work done in the Ontology Engineering field, the development of ontologies within and between teams has increased and improved, as well as the possibility of reusing ontologies in other developments and in final applications. Currently, ontologies are widely used in (a) Knowledge Engineering, Artificial Intelligence and Computer Science, (b) applications related to knowledge management, natural language processing, e-commerce, intelligent information integration, information retrieval, database design and integration, bio-informatics, education, and (c) the Semantic Web, the Semantic Grid, and the Linked Data initiative. In this paper, we provide an overview of Ontology Engineering, mentioning the most outstanding and used methodologies, languages, and tools for building ontologies. In addition, we include some words on how all these elements can be used in the Linked Data initiative.
Resumo:
En esta tesis se estudia la representación, modelado y comparación de colecciones mediante el uso de ontologías en el ámbito de la Web Semántica. Las colecciones, entendidas como agrupaciones de objetos o elementos con entidad propia, son construcciones que aparecen frecuentemente en prácticamente todos los dominios del mundo real, y por tanto, es imprescindible disponer de conceptualizaciones de estas estructuras abstractas y de representaciones de estas conceptualizaciones en los sistemas informáticos, que definan adecuadamente su semántica. Mientras que en muchos ámbitos de la Informática y la Inteligencia Artificial, como por ejemplo la programación, las bases de datos o la recuperación de información, las colecciones han sido ampliamente estudiadas y se han desarrollado representaciones que responden a multitud de conceptualizaciones, en el ámbito de la Web Semántica, sin embargo, su estudio ha sido bastante limitado. De hecho hasta la fecha existen pocas propuestas de representación de colecciones mediante ontologías, y las que hay sólo cubren algunos tipos de colecciones y presentan importantes limitaciones. Esto impide la representación adecuada de colecciones y dificulta otras tareas comunes como la comparación de colecciones, algo crítico en operaciones habituales como las búsquedas semánticas o el enlazado de datos en la Web Semántica. Para solventar este problema esta tesis hace una propuesta de modelización de colecciones basada en una nueva clasificación de colecciones de acuerdo a sus características estructurales (homogeneidad, unicidad, orden y cardinalidad). Esta clasificación permite definir una taxonomía con hasta 16 tipos de colecciones distintas. Entre otras ventajas, esta nueva clasificación permite aprovechar la semántica de las propiedades estructurales de cada tipo de colección para realizar comparaciones utilizando las funciones de similitud y disimilitud más apropiadas. De este modo, la tesis desarrolla además un nuevo catálogo de funciones de similitud para las distintas colecciones, donde se han recogido las funciones de (di)similitud más conocidas y también algunas nuevas. Esta propuesta se ha implementado mediante dos ontologías paralelas, la ontología E-Collections, que representa los distintos tipos de colecciones de la taxonomía y su axiomática, y la ontología SIMEON (Similarity Measures Ontology) que representa los tipos de funciones de (di)similitud para cada tipo de colección. Gracias a estas ontologías, para comparar dos colecciones, una vez representadas como instancias de la clase más apropiada de la ontología E-Collections, automáticamente se sabe qué funciones de (di)similitud de la ontología SIMEON pueden utilizarse para su comparación. Abstract This thesis studies the representation, modeling and comparison of collections in the Semantic Web using ontologies. Collections, understood as groups of objects or elements with their own identities, are constructions that appear frequently in almost all areas of the real world. Therefore, it is essential to have conceptualizations of these abstract structures and representations of these conceptualizations in computer systems, that define their semantic properly. While in many areas of Computer Science and Artificial Intelligence, such as Programming, Databases or Information Retrieval, the collections have been extensively studied and there are representations that match many conceptualizations, in the field Semantic Web, however, their study has been quite limited. In fact, there are few representations of collections using ontologies so far, and they only cover some types of collections and have important limitations. This hinders a proper representation of collections and other common tasks like comparing collections, something critical in usual operations such as semantic search or linking data on the Semantic Web. To solve this problem this thesis makes a proposal for modelling collections based on a new classification of collections according to their structural characteristics (homogeneity, uniqueness, order and cardinality). This classification allows to define a taxonomy with up to 16 different types of collections. Among other advantages, this new classification can leverage the semantics of the structural properties of each type of collection to make comparisons using the most appropriate (dis)similarity functions. Thus, the thesis also develops a new catalog of similarity functions for the different types of collections. This catalog contains the most common (dis)similarity functions as well as new ones. This proposal is implemented through two parallel ontologies, the E-Collections ontology that represents the different types of collections in the taxonomy and their axiomatic, and the SIMEON ontology (Similarity Measures Ontology) that represents the types of (dis)similarity functions for each type of collection. Thanks to these ontologies, to compare two collections, once represented as instances of the appropriate class of E-Collections ontology, we can know automatically which (dis)similarity functions of the SIMEON ontology are suitable for the comparison. Finally, the feasibility and usefulness of this modeling and comparison of collections proposal is proved in the field of oenology, applying both E-Collections and SIMEON ontologies to the representation and comparison of wines with the E-Baco ontology.
Resumo:
La tesis que se presenta tiene como propósito la construcción automática de ontologías a partir de textos, enmarcándose en el área denominada Ontology Learning. Esta disciplina tiene como objetivo automatizar la elaboración de modelos de dominio a partir de fuentes información estructurada o no estructurada, y tuvo su origen con el comienzo del milenio, a raíz del crecimiento exponencial del volumen de información accesible en Internet. Debido a que la mayoría de información se presenta en la web en forma de texto, el aprendizaje automático de ontologías se ha centrado en el análisis de este tipo de fuente, nutriéndose a lo largo de los años de técnicas muy diversas provenientes de áreas como la Recuperación de Información, Extracción de Información, Sumarización y, en general, de áreas relacionadas con el procesamiento del lenguaje natural. La principal contribución de esta tesis consiste en que, a diferencia de la mayoría de las técnicas actuales, el método que se propone no analiza la estructura sintáctica superficial del lenguaje, sino que estudia su nivel semántico profundo. Su objetivo, por tanto, es tratar de deducir el modelo del dominio a partir de la forma con la que se articulan los significados de las oraciones en lenguaje natural. Debido a que el nivel semántico profundo es independiente de la lengua, el método permitirá operar en escenarios multilingües, en los que es necesario combinar información proveniente de textos en diferentes idiomas. Para acceder a este nivel del lenguaje, el método utiliza el modelo de las interlinguas. Estos formalismos, provenientes del área de la traducción automática, permiten representar el significado de las oraciones de forma independiente de la lengua. Se utilizará en concreto UNL (Universal Networking Language), considerado como la única interlingua de propósito general que está normalizada. La aproximación utilizada en esta tesis supone la continuación de trabajos previos realizados tanto por su autor como por el equipo de investigación del que forma parte, en los que se estudió cómo utilizar el modelo de las interlinguas en las áreas de extracción y recuperación de información multilingüe. Básicamente, el procedimiento definido en el método trata de identificar, en la representación UNL de los textos, ciertas regularidades que permiten deducir las piezas de la ontología del dominio. Debido a que UNL es un formalismo basado en redes semánticas, estas regularidades se presentan en forma de grafos, generalizándose en estructuras denominadas patrones lingüísticos. Por otra parte, UNL aún conserva ciertos mecanismos de cohesión del discurso procedentes de los lenguajes naturales, como el fenómeno de la anáfora. Con el fin de aumentar la efectividad en la comprensión de las expresiones, el método provee, como otra contribución relevante, la definición de un algoritmo para la resolución de la anáfora pronominal circunscrita al modelo de la interlingua, limitada al caso de pronombres personales de tercera persona cuando su antecedente es un nombre propio. El método propuesto se sustenta en la definición de un marco formal, que ha debido elaborarse adaptando ciertas definiciones provenientes de la teoría de grafos e incorporando otras nuevas, con el objetivo de ubicar las nociones de expresión UNL, patrón lingüístico y las operaciones de encaje de patrones, que son la base de los procesos del método. Tanto el marco formal como todos los procesos que define el método se han implementado con el fin de realizar la experimentación, aplicándose sobre un artículo de la colección EOLSS “Encyclopedia of Life Support Systems” de la UNESCO. ABSTRACT The purpose of this thesis is the automatic construction of ontologies from texts. This thesis is set within the area of Ontology Learning. This discipline aims to automatize domain models from structured or unstructured information sources, and had its origin with the beginning of the millennium, as a result of the exponential growth in the volume of information accessible on the Internet. Since most information is presented on the web in the form of text, the automatic ontology learning is focused on the analysis of this type of source, nourished over the years by very different techniques from areas such as Information Retrieval, Information Extraction, Summarization and, in general, by areas related to natural language processing. The main contribution of this thesis consists of, in contrast with the majority of current techniques, the fact that the method proposed does not analyze the syntactic surface structure of the language, but explores his deep semantic level. Its objective, therefore, is trying to infer the domain model from the way the meanings of the sentences are articulated in natural language. Since the deep semantic level does not depend on the language, the method will allow to operate in multilingual scenarios, where it is necessary to combine information from texts in different languages. To access to this level of the language, the method uses the interlingua model. These formalisms, coming from the area of machine translation, allow to represent the meaning of the sentences independently of the language. In this particular case, UNL (Universal Networking Language) will be used, which considered to be the only interlingua of general purpose that is standardized. The approach used in this thesis corresponds to the continuation of previous works carried out both by the author of this thesis and by the research group of which he is part, in which it is studied how to use the interlingua model in the areas of multilingual information extraction and retrieval. Basically, the procedure defined in the method tries to identify certain regularities at the UNL representation of texts that allow the deduction of the parts of the ontology of the domain. Since UNL is a formalism based on semantic networks, these regularities are presented in the form of graphs, generalizing in structures called linguistic patterns. On the other hand, UNL still preserves certain mechanisms of discourse cohesion from natural languages, such as the phenomenon of the anaphora. In order to increase the effectiveness in the understanding of expressions, the method provides, as another significant contribution, the definition of an algorithm for the resolution of pronominal anaphora limited to the model of the interlingua, in the case of third person personal pronouns when its antecedent is a proper noun. The proposed method is based on the definition of a formal framework, adapting some definitions from Graph Theory and incorporating new ones, in order to locate the notions of UNL expression and linguistic pattern, as well as the operations of pattern matching, which are the basis of the method processes. Both the formal framework and all the processes that define the method have been implemented in order to carry out the experimentation, applying on an article of the "Encyclopedia of Life Support Systems" of the UNESCO-EOLSS collection.
Resumo:
This paper presents the 2006 Miracle team’s approaches to the Ad-Hoc and Geographical Information Retrieval tasks. A first set of runs was obtained using a set of basic components. Then, by putting together special combinations of these runs, an extended set was obtained. With respect to previous campaigns some improvements have been introduced in our system: an entity recognition prototype is integrated in our tokenization scheme, and the performance of our indexing and retrieval engine has been improved. For GeoCLEF, we tested retrieving using geo-entity and textual references separately, and then combining them with different approaches.
Resumo:
This paper presents the 2005 MIRACLE team’s approach to Cross-Language Geographical Retrieval (GeoCLEF). The main goal of the GeoCLEF participation of the MIRACLE team was to test the effect that geographical information retrieval techniques have on information retrieval. The baseline approach is based on the development of named entity recognition and geospatial information retrieval tools and on its combination with linguistic techniques to carry out indexing and retrieval tasks.
Resumo:
This paper presents the 2005 Miracle’s team approach to the Ad-Hoc Information Retrieval tasks. The goal for the experiments this year was twofold: to continue testing the effect of combination approaches on information retrieval tasks, and improving our basic processing and indexing tools, adapting them to new languages with strange encoding schemes. The starting point was a set of basic components: stemming, transforming, filtering, proper nouns extraction, paragraph extraction, and pseudo-relevance feedback. Some of these basic components were used in different combinations and order of application for document indexing and for query processing. Second-order combinations were also tested, by averaging or selective combination of the documents retrieved by different approaches for a particular query. In the multilingual track, we concentrated our work on the merging process of the results of monolingual runs to get the overall multilingual result, relying on available translations. In both cross-lingual tracks, we have used available translation resources, and in some cases we have used a combination approach.
Resumo:
This paper describes an infrastructure for the automated evaluation of semantic technologies and, in particular, semantic search technologies. For this purpose, we present an evaluation framework which follows a service-oriented approach for evaluating semantic technologies and uses the Business Process Execution Language (BPEL) to define evaluation workflows that can be executed by process engines. This framework supports a variety of evaluations, from different semantic areas, including search, and is extendible to new evaluations. We show how BPEL addresses this diversity as well as how it is used to solve specific challenges such as heterogeneity, error handling and reuse
Resumo:
Las redes sociales en la actualidad son muy relevantes, no solo ocupan mucho tiempo en la vida diaria de las personas si no que también sirve a millones de empresas para publicitarse entre otras cosas. Al fenómeno de las redes sociales se le ha unido la faceta empresarial. La liberación de las APIs de algunas redes sociales ha permitido el desarrollo de aplicaciones de todo tipo y que puedan tener diferentes objetivos como por ejemplo este proyecto. Este proyecto comenzó desde el interés por Ericsson del estudio del API de Google+ y sugerencias para dar valores añadidos a las empresas de telecomunicaciones. También ha complementando la referencia disponible en Ericsson y de los otros dos proyectos de recuperación de información de las redes sociales, añadiendo una serie de opciones para el usuario en la aplicación. Para ello, se ha analizado y realizado un ejemplo, de lo que podemos obtener de las redes sociales, principalmente Twitter y Google+. Lo primero en lo que se ha basado el proyecto ha sido en realizar un estudio teórico sobre el inicio de las redes sociales, el desarrollo y el estado en el que se encuentran, analizando así las principales redes sociales que existen y aportando una visión general sobre todas ellas. También se ha realizado un estado de arte sobre una serie de webs que se dedican al uso de esa información disponible en Internet. Posteriormente, de todas las redes sociales con APIs disponibles se realizó la elección de Google+ porque es una red social nueva aun por explorar y mejorar. Y la elección de Twitter por la serie de opciones y datos que se puede obtener de ella. De ambas se han estudiado sus APIs, para posteriormente con la información obtenida, realizar una aplicación prototipo que recogiera una serie de funciones útiles a partir de los datos de sus redes sociales. Por último se ha realizado una simple interfaz en la cual se puede acceder a los datos de la cuenta como si se estuviera en Twitter o Google+, además con los datos de Twitter se puede realizar una búsqueda avanzada con alertas, un análisis de sentimiento, ver tus mayores retweets de los que te siguen y por último realizar un seguimiento comparando lo que se comenta sobre dos temas determinados. Con este proyecto se ha pretendido proporcionar una idea general de todo lo relacionado con las redes sociales, las aplicaciones disponibles para trabajar con ellas, la información del API de Twitter y Google+ y un concepto de lo que se puede obtener. Today social networks are very relevant, they not only take a long time in daily life of people but also serve millions of businesses to advertise and other things. The phenomenon of social networks has been joined the business side. The release of the APIs of some social networks has allowed the development of applications of all types and different objectives such as this project. This project started from an interest in the study of Ericsson about Google+ API and suggestions to add value to telecommunications companies. This project has complementing the reference available in Ericsson and the other two projects of information retrieval of social networks, adding a number of options for the user in the application. To do this, we have analyzed and made an example of what we can get it from social networks, mainly Twitter and Google+. The first thing that has done in the project was to make a theoretical study on the initiation of social networks, the development and the state in which they are found, and analyze the major social networks that exist. There has also been made a state of art on a number of websites that are dedicated to the use of this information available online. Subsequently, about all the social networks APIs available, Google+ was choice because it is a new social network even to explore and improve. And the choice of Twitter for the number of options and data that can be obtained from it. In both APIs have been studied, and later with the information obtained, make a prototype application to collect a number of useful features from data of social networks. Finally there has been a simple interface, in which you can access the account as if you were on Twitter or Google+. With Twitter data can perform an advanced search with alerts, sentiment analysis, see retweets of who follow you and make comparing between two particular topics. This project is intended to provide an overview of everything related to social networks, applications available to work with them, information about API of Google+ and Twitter, and a concept of what you can get.
Resumo:
Collaborative filtering recommender systems contribute to alleviating the problem of information overload that exists on the Internet as a result of the mass use of Web 2.0 applications. The use of an adequate similarity measure becomes a determining factor in the quality of the prediction and recommendation results of the recommender system, as well as in its performance. In this paper, we present a memory-based collaborative filtering similarity measure that provides extremely high-quality and balanced results; these results are complemented with a low processing time (high performance), similar to the one required to execute traditional similarity metrics. The experiments have been carried out on the MovieLens and Netflix databases, using a representative set of information retrieval quality measures.
Resumo:
Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significants concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more? bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.
Resumo:
La proliferación en todos los ámbitos de la producción multimedia está dando lugar a la aparición de nuevos paradigmas de recuperación de información visual. Dentro de éstos, uno de los más significativos es el de los sistemas de recuperación de información visual, VIRS (Visual Information Retrieval Systems), en los que una de las tareas más representativas es la ordenación de una población de imágenes según su similitud con un ejemplo dado. En este trabajo se presenta una propuesta original para la evaluación de la similitud entre dos imágenes, basándose en la extensión del concepto de saliencia desde el espacio de imágenes al de características para establecer la relevancia de cada componente de dicho vector. Para ello se introducen metodologías para la cuantificación de la saliencia de valores individuales de características, para la combinación de estas cuantificaciones en procesos de comparación entre dos imágenes, y para, finalmente, establecer la mencionada ponderación de cada característica en atención a esta combinación. Se presentan igualmente los resultados de evaluar esta propuesta en una tarea de recuperación de imágenes por contenido en comparación con los obtenidos con la distancia euclídea. Esta comparación se realiza mediante la evaluación de ambos resultados por voluntarios.
Resumo:
Over the last few decades, the ever-increasing output of scientific publications has led to new challenges to keep up to date with the literature. In the biomedical area, this growth has introduced new requirements for professionals, e.g., physicians, who have to locate the exact papers that they need for their clinical and research work amongst a huge number of publications. Against this backdrop, novel information retrieval methods are even more necessary. While web search engines are widespread in many areas, facilitating access to all kinds of information, additional tools are required to automatically link information retrieved from these engines to specific biomedical applications. In the case of clinical environments, this also means considering aspects such as patient data security and confidentiality or structured contents, e.g., electronic health records (EHRs). In this scenario, we have developed a new tool to facilitate query building to retrieve scientific literature related to EHRs. Results: We have developed CDAPubMed, an open-source web browser extension to integrate EHR features in biomedical literature retrieval approaches. Clinical users can use CDAPubMed to: (i) load patient clinical documents, i.e., EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA), (ii) identify relevant terms for scientific literature search in these documents, i.e., Medical Subject Headings (MeSH), automatically driven by the CDAPubMed configuration, which advanced users can optimize to adapt to each specific situation, and (iii) generate and launch literature search queries to a major search engine, i.e., PubMed, to retrieve citations related to the EHR under examination. Conclusions: CDAPubMed is a platform-independent tool designed to facilitate literature searching using keywords contained in specific EHRs. CDAPubMed is visually integrated, as an extension of a widespread web browser, within the standard PubMed interface. It has been tested on a public dataset of HL7-CDA documents, returning significantly fewer citations since queries are focused on characteristics identified within the EHR. For instance, compared with more than 200,000 citations retrieved by breast neoplasm, fewer than ten citations were retrieved when ten patient features were added using CDAPubMed. This is an open source tool that can be freely used for non-profit purposes and integrated with other existing systems.