14 resultados para Open source information retrieval
em Universidad de Alicante
Resumo:
In this paper we explore the use of semantic classes in an existing information retrieval system in order to improve its results. Thus, we use two different ontologies of semantic classes (WordNet domain and Basic Level Concepts) in order to re-rank the retrieved documents and obtain better recall and precision. Finally, we implement a new method for weighting the expanded terms taking into account the weights of the original query terms and their relations in WordNet with respect to the new ones (which have demonstrated to improve the results). The evaluation of these approaches was carried out in the CLEF Robust-WSD Task, obtaining an improvement of 1.8% in GMAP for the semantic classes approach and 10% in MAP employing the WordNet term weighting approach.
Resumo:
Nowadays there is a big amount of biomedical literature which uses complex nouns and acronyms of biological entities thus complicating the task of retrieval specific information. The Genomics Track works for this goal and this paper describes the approach we used to take part of this track of TREC 2007. As this is the first time we participate in this track, we configurated a new system consisting of the following diferenciated parts: preprocessing, passage generation, document retrieval and passage (with the answer) extraction. We want to call special attention to the textual retrieval system used, which was developed by the University of Alicante. Adapting the resources for the propouse, our system has obtained precision results over the mean and median average of the 66 official runs for the Document, Aspect and Passage2 MAP; and in the case of Passage MAP we get nearly the median and mean value. We want to emphasize we have obtained these results without incorporating specific information about the domain of the track. For the future, we would like to further develop our system in this direction.
Resumo:
In this paper we present a complete system for the treatment of both geographical and temporal dimensions in text and its application to information retrieval. This system has been evaluated in both the GeoTime task of the 8th and 9th NTCIR workshop in the years 2010 and 2011 respectively, making it possible to compare the system to contemporary approaches to the topic. In order to participate in this task we have added the temporal dimension to our GIR system. The system proposed here has a modular architecture in order to add or modify features. In the development of this system, we have followed a QA-based approach as well as multi-search engines to improve the system performance.
Resumo:
Automatic Text Summarization has been shown to be useful for Natural Language Processing tasks such as Question Answering or Text Classification and other related fields of computer science such as Information Retrieval. Since Geographical Information Retrieval can be considered as an extension of the Information Retrieval field, the generation of summaries could be integrated into these systems by acting as an intermediate stage, with the purpose of reducing the document length. In this manner, the access time for information searching will be improved, while at the same time relevant documents will be also retrieved. Therefore, in this paper we propose the generation of two types of summaries (generic and geographical) applying several compression rates in order to evaluate their effectiveness in the Geographical Information Retrieval task. The evaluation has been carried out using GeoCLEF as evaluation framework and following an Information Retrieval perspective without considering the geo-reranking phase commonly used in these systems. Although single-document summarization has not performed well in general, the slight improvements obtained for some types of the proposed summaries, particularly for those based on geographical information, made us believe that the integration of Text Summarization with Geographical Information Retrieval may be beneficial, and consequently, the experimental set-up developed in this research work serves as a basis for further investigations in this field.
Resumo:
Information Retrieval systems normally have to work with rather heterogeneous sources, such as Web sites or documents from Optical Character Recognition tools. The correct conversion of these sources into flat text files is not a trivial task since noise may easily be introduced as a result of spelling or typeset errors. Interestingly, this is not a great drawback when the size of the corpus is sufficiently large, since redundancy helps to overcome noise problems. However, noise becomes a serious problem in restricted-domain Information Retrieval specially when the corpus is small and has little or no redundancy. This paper devises an approach which adds noise-tolerance to Information Retrieval systems. A set of experiments carried out in the agricultural domain proves the effectiveness of the approach presented.
Resumo:
Machine vision is an important subject in computer science and engineering degrees. For laboratory experimentation, it is desirable to have a complete and easy-to-use tool. In this work we present a Java library, oriented to teaching computer vision. We have designed and built the library from the scratch with enfasis on readability and understanding rather than on efficiency. However, the library can also be used for research purposes. JavaVis is an open source Java library, oriented to the teaching of Computer Vision. It consists of a framework with several features that meet its demands. It has been designed to be easy to use: the user does not have to deal with internal structures or graphical interface, and should the student need to add a new algorithm it can be done simply enough. Once we sketch the library, we focus on the experience the student gets using this library in several computer vision courses. Our main goal is to find out whether the students understand what they are doing, that is, find out how much the library helps the student in grasping the basic concepts of computer vision. In the last four years we have conducted surveys to assess how much the students have improved their skills by using this library.
Resumo:
El campo de procesamiento de lenguaje natural (PLN), ha tenido un gran crecimiento en los últimos años; sus áreas de investigación incluyen: recuperación y extracción de información, minería de datos, traducción automática, sistemas de búsquedas de respuestas, generación de resúmenes automáticos, análisis de sentimientos, entre otras. En este artículo se presentan conceptos y algunas herramientas con el fin de contribuir al entendimiento del procesamiento de texto con técnicas de PLN, con el propósito de extraer información relevante que pueda ser usada en un gran rango de aplicaciones. Se pueden desarrollar clasificadores automáticos que permitan categorizar documentos y recomendar etiquetas; estos clasificadores deben ser independientes de la plataforma, fácilmente personalizables para poder ser integrados en diferentes proyectos y que sean capaces de aprender a partir de ejemplos. En el presente artículo se introducen estos algoritmos de clasificación, se analizan algunas herramientas de código abierto disponibles actualmente para llevar a cabo estas tareas y se comparan diversas implementaciones utilizando la métrica F en la evaluación de los clasificadores.
Resumo:
Evacuation route planning is a fundamental task for building engineering projects. Safety regulations are established so that all occupants are driven on time out of a building to a secure place when faced with an emergency situation. As an example, Spanish building code requires the planning of evacuation routes on large and, usually, public buildings. Engineers often plan these routes on single building projects, repeatedly assigning clusters of rooms to each emergency exit in a trial-and-error process. But problems may arise for a building complex where distribution and use changes make visual analysis cumbersome and sometimes unfeasible. This problem could be solved by using well-known spatial analysis techniques, implemented as a specialized software able to partially emulate engineer reasoning. In this paper we propose and test an easily reproducible methodology that makes use of free and open source software components for solving a case study. We ran a complete test on a building floor at the University of Alicante (Spain). This institution offers a web service (WFS) that allows retrieval of 2D geometries from any building within its campus. We demonstrate how geospatial technologies and computational geometry algorithms can be used for automating the creation and optimization of evacuation routes. In our case study, the engineers’ task is to verify that the load capacity of each emergency exit does not exceed the standards specified by Spain’s current regulations. Using Dijkstra’s algorithm, we obtain the shortest paths from every room to the most appropriate emergency exit. Once these paths are calculated, engineers can run simulations and validate, based on path statistics, different cluster configurations. Techniques and tools applied in this research would be helpful in the design and risk management phases of any complex building project.
Resumo:
This paper describes the first participation of IR-n system at Spoken Document Retrieval, focusing on the experiments we made before participation and showing the results we obtained. IR-n system is an Information Retrieval system based on passages and the recognition of sentences to define them. So, the main goal of this experiment is to adapt IR-n system to the spoken document structure by means of the utterance splitter and the overlapping passage technique allowing to match utterances and sentences.
Resumo:
In the last few years, there has been a wide development in the research on textual information systems. The goal is to improve these systems in order to allow an easy localization, treatment and access to the information stored in digital format (Digital Databases, Documental Databases, and so on). There are lots of applications focused on information access (for example, Web-search systems like Google or Altavista). However, these applications have problems when they must access to cross-language information, or when they need to show information in a language different from the one of the query. This paper explores the use of syntactic-sematic patterns as a method to access to multilingual information, and revise, in the case of Information Retrieval, where it is possible and useful to employ patterns when it comes to the multilingual and interactive aspects. On the one hand, the multilingual aspects that are going to be studied are the ones related to the access to documents in different languages from the one of the query, as well as the automatic translation of the document, i.e. a machine translation system based on patterns. On the other hand, this paper is going to go deep into the interactive aspects related to the reformulation of a query based on the syntactic-semantic pattern of the request.
Resumo:
The exponential increase of subjective, user-generated content since the birth of the Social Web, has led to the necessity of developing automatic text processing systems able to extract, process and present relevant knowledge. In this paper, we tackle the Opinion Retrieval, Mining and Summarization task, by proposing a unified framework, composed of three crucial components (information retrieval, opinion mining and text summarization) that allow the retrieval, classification and summarization of subjective information. An extensive analysis is conducted, where different configurations of the framework are suggested and analyzed, in order to determine which is the best one, and under which conditions. The evaluation carried out and the results obtained show the appropriateness of the individual components, as well as the framework as a whole. By achieving an improvement over 10% compared to the state-of-the-art approaches in the context of blogs, we can conclude that subjective text can be efficiently dealt with by means of our proposed framework.
Resumo:
El estudio de las disciplinas científicas resulta más atractivo si se acompaña de actividades de carácter práctico. En este trabajo se propone un taller cuya finalidad es introducir al alumnado en el trabajo científico que realizan los geólogos y paleontólogos a través de la información paleoambiental y bioestratigráfica que proporcionan los microfósiles y su aplicación a la Crisis de Salinidad del Messiniense. Este periodo es considerado como uno de los acontecimientos más relevantes de la historia geológica del Mediterráneo y se caracteriza por una acumulación masiva de evaporitas en el fondo de la cuenca, que se relaciona con la desecación y posterior reinundación del Mediterráneo hace aproximadamente cinco millones de años. El taller consta de tres sesiones: una teórica, de introducción de los contenidos necesarios para el desarrollo de la actividad, para la que se proponen una serie de recursos bibliográficos y audiovisuales de libre acceso en internet; una práctica, de obtención de datos; y una final, de interpretación de los cambios paleoambientales que conlleva la presentación de los resultados en forma de artículo científico y posterior debate en el aula. Todos los datos necesarios para el desarrollo de la actividad se proporcionan en el presente artículo, si bien esta propuesta de taller queda abierta a las posibles modificaciones y mejoras que el profesorado considere oportunas. Para vertebrar esta propuesta, en forma de ejemplo de aplicación, se ha incluido el taller en la programación de la asignatura Biología y Geología (4º ESO). La puesta a punto de este taller pone de manifiesto que resulta idóneo para el trabajo en grupo en el aula permitiendo que el alumnado se sienta partícipe de todas las fases que constituyen una investigación científica.
Resumo:
The main goal of this paper is to present the initial version of a Textile Chemical Ontology, to be used by textile professionals with the purpose of conceptualising and representing the banned and harmful chemical substances that are forbidden in this domain. After analysing different methodologies and determining that “Methontology” is the most appropriate for the purposes, this methodology is explored and applied to the domain. In this manner, an initial set of concepts are defined, together with their hierarchy and the relationships between them. This paper shows the benefits of using the ontology through a real use case in the context of Information Retrieval. The potentiality of the proposed ontology in this preliminary evaluation encourages extending the ontology with a higher number of concepts and relationships, and validating it within other Natural Language Processing applications.
Resumo:
Camera traps have become a widely used technique for conducting biological inventories, generating a large number of database records of great interest. The main aim of this paper is to describe a new free and open source software (FOSS), developed to facilitate the management of camera-trapped data which originated from a protected Mediterranean area (SE Spain). In the last decade, some other useful alternatives have been proposed, but ours focuses especially on a collaborative undertaking and on the importance of spatial information underpinning common camera trap studies. This FOSS application, namely, “Camera Trap Manager” (CTM), has been designed to expedite the processing of pictures on the .NET platform. CTM has a very intuitive user interface, automatic extraction of some image metadata (date, time, moon phase, location, temperature, atmospheric pressure, among others), analytical (Geographical Information Systems, statistics, charts, among others), and reporting capabilities (ESRI Shapefiles, Microsoft Excel Spreadsheets, PDF reports, among others). Using this application, we have achieved a very simple management, fast analysis, and a significant reduction of costs. While we were able to classify an average of 55 pictures per hour manually, CTM has made it possible to process over 1000 photographs per hour, consequently retrieving a greater amount of data.