868 resultados para traduzione automatica, machine translation, post-editing, pre-editing, workflow, LSP, TA, MT
Resumo:
This paper analyzes how machine translation has changed the way translation is conceived and practiced in the information age. From a brief review of the early designs of machine translation programs, I discuss the changes implemented in the past decades in these systems to combine mechanical processing and the accessory work by the translator.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shed light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts. Copyright (C) EPLA, 2012
Resumo:
Rice (Oryza sativa L.) is an important cash crop in Honduras because of the rice lobby’s size, willingness to protest, and ability to negotiate favorable price guarantees on a year-to-year basis. Despite the availability of inexpensive irrigation in the study area in Flores, La Villa de San Antonio, Comayagua, the rice farmers do not cultivate the crop using prescribed methods such as land leveling, puddling, and water conservation structures. Soil moisture (Volumetric Water Content) was measured using a soil moisture probe after the termination of the first irrigation within the tillering/vegetative, panicle emergence/flowering, post-flowering/pre-maturation and maturation stages. Yield data was obtained by harvesting on 1 m2 plots in each soil moisture testing site. Data was analyzed to find the influence of toposequential position along transects, slope, soil moisture, and farmers on yields. The results showed that toposequential position was more important than slope and soil moisture on yields. Soil moisture was not a significant predictor of rice yields. Irrigation politics, precipitation, and land tenure were proposed as the major explanatory variables for this result.
Resumo:
We postulated that neuromuscular disuse results in deleteriously affected tissue-vascular fluid exchange processes and subsequently damages the important oxidative bioenergetic process of intramuscular lipid metabolism. The in-depth research reported in the literature is somewhat limited by the ex vivo nature and sporadic time-course characterization of disuse atrophy and recovery. Thus, an in vivo controlled, localized animal model of disuse atrophy was developed in one of the hindlimbs of laboratory rabbits (employing surgically implanted tetrodotoxin (TTX)-filled mini-osmotic pump-sciatic nerve superfusion system) and tested repeatedly with magnetic resonance (MR) throughout the 2-week period of temporarily induced disuse and during the recovery period (following explantation of the TTX-filled pump) for a period of 3 weeks. Controls consisted of saline/"sham"-implanted rabbit hindlimbs. The validity of this model was established with repeated electrophysiologic nerve conduction testing using a clinically appropriate protocol and percutaneously inserted small needle stimulating and recording electrodes. Evoked responses recorded from proximal (P) and distal (D) sites to the sciatic nerve cuff in the TTX-implanted group revealed significantly decreased (p $<$ 0.001) proximal-to-distal (P/D) amplitude ratios (as much as 50-70% below Baseline/pre-implanted and sham-implanted group values) and significantly increased (p $<$ 0.01) differential latency (PL-DL) values (as much as 1.5 times the pre- and sham-implanted groups). By Day 21 of recovery, observed P/D and PL-DL levels matched Baseline/sham-implemented levels. MRI-determined cross-sectional area (CSA) values of Baseline/pre-implanted, sham- or TTX-implanted, and recovering/explanted and the corresponding contralateral hindlimb tibialis anterior (TA) muscles normalized to tibial bone (TB) CSA (in TA/TB ratios) revealed that there was a significant decline (indicative of atrophic response) from pre- and sham-implanted controls by as much as 20% (p $<$ 0.01) at Day 7 and 50-55% (p $<$ 0.001) at Day 13 of TTX-implantation. In the non-implanted contralaterals, a significant increase (indicative of hypertrophic response) by as much as 10% (p $<$ 0.025) at Day 7 and 27% (p $<$ 0.001) at Day 13 + TTX was found. The induced atrophic/hypertrophic TA muscles were observed to be fully recovered by Day 21 post-explantation as evidenced by image TA/TB ratios. End-point biopsy results from a small group of rabbits revealed comprehensive atrophy of both Type I and Type II fibers, although the heterogeneity of the response supports the use of image-guided, volume-localized proton magnetic resonance spectroscopy (MRS) to noninvasively assess tissue-level metabolic changes. MRS-determined results of a 0.25cc volume of tissue within implanted limb TA muscles under resting/pre-ischemic, ischemic-stressed, and post-ischemic conditions at timepoints during and following disuse atrophy/recovery revealed significantly increased intramuscular spectral lipid levels, as much as 2-3 times (p $<$ 0.01) the Baseline/pre-implanted values at Day 7 and 6-7 times (p $<$ 0.001) at Day 13 + TTX, which approached normal levels (compared to pre- and sham-implanted groups) by Day 21 of post-explanation recovery. (Abstract shortened by UMI.) ^
Resumo:
Review of this book, that is the author's Thesis Dissertation.
Resumo:
OntoTag - A Linguistic and Ontological Annotation Model Suitable for the Semantic Web
1. INTRODUCTION. LINGUISTIC TOOLS AND ANNOTATIONS: THEIR LIGHTS AND SHADOWS
Computational Linguistics is already a consolidated research area. It builds upon the results of other two major ones, namely Linguistics and Computer Science and Engineering, and it aims at developing computational models of human language (or natural language, as it is termed in this area). Possibly, its most well-known applications are the different tools developed so far for processing human language, such as machine translation systems and speech recognizers or dictation programs.
These tools for processing human language are commonly referred to as linguistic tools. Apart from the examples mentioned above, there are also other types of linguistic tools that perhaps are not so well-known, but on which most of the other applications of Computational Linguistics are built. These other types of linguistic tools comprise POS taggers, natural language parsers and semantic taggers, amongst others. All of them can be termed linguistic annotation tools.
Linguistic annotation tools are important assets. In fact, POS and semantic taggers (and, to a lesser extent, also natural language parsers) have become critical resources for the computer applications that process natural language. Hence, any computer application that has to analyse a text automatically and ‘intelligently’ will include at least a module for POS tagging. The more an application needs to ‘understand’ the meaning of the text it processes, the more linguistic tools and/or modules it will incorporate and integrate.
However, linguistic annotation tools have still some limitations, which can be summarised as follows:
1. Normally, they perform annotations only at a certain linguistic level (that is, Morphology, Syntax, Semantics, etc.).
2. They usually introduce a certain rate of errors and ambiguities when tagging. This error rate ranges from 10 percent up to 50 percent of the units annotated for unrestricted, general texts.
3. Their annotations are most frequently formulated in terms of an annotation schema designed and implemented ad hoc.
A priori, it seems that the interoperation and the integration of several linguistic tools into an appropriate software architecture could most likely solve the limitations stated in (1). Besides, integrating several linguistic annotation tools and making them interoperate could also minimise the limitation stated in (2). Nevertheless, in the latter case, all these tools should produce annotations for a common level, which would have to be combined in order to correct their corresponding errors and inaccuracies. Yet, the limitation stated in (3) prevents both types of integration and interoperation from being easily achieved.
In addition, most high-level annotation tools rely on other lower-level annotation tools and their outputs to generate their own ones. For example, sense-tagging tools (operating at the semantic level) often use POS taggers (operating at a lower level, i.e., the morphosyntactic) to identify the grammatical category of the word or lexical unit they are annotating. Accordingly, if a faulty or inaccurate low-level annotation tool is to be used by other higher-level one in its process, the errors and inaccuracies of the former should be minimised in advance. Otherwise, these errors and inaccuracies would be transferred to (and even magnified in) the annotations of the high-level annotation tool.
Therefore, it would be quite useful to find a way to
(i) correct or, at least, reduce the errors and the inaccuracies of lower-level linguistic tools;
(ii) unify the annotation schemas of different linguistic annotation tools or, more generally speaking, make these tools (as well as their annotations) interoperate.
Clearly, solving (i) and (ii) should ease the automatic annotation of web pages by means of linguistic tools, and their transformation into Semantic Web pages (Berners-Lee, Hendler and Lassila, 2001). Yet, as stated above, (ii) is a type of interoperability problem. There again, ontologies (Gruber, 1993; Borst, 1997) have been successfully applied thus far to solve several interoperability problems. Hence, ontologies should help solve also the problems and limitations of linguistic annotation tools aforementioned.
Thus, to summarise, the main aim of the present work was to combine somehow these separated approaches, mechanisms and tools for annotation from Linguistics and Ontological Engineering (and the Semantic Web) in a sort of hybrid (linguistic and ontological) annotation model, suitable for both areas. This hybrid (semantic) annotation model should (a) benefit from the advances, models, techniques, mechanisms and tools of these two areas; (b) minimise (and even solve, when possible) some of the problems found in each of them; and (c) be suitable for the Semantic Web. The concrete goals that helped attain this aim are presented in the following section.
2. GOALS OF THE PRESENT WORK
As mentioned above, the main goal of this work was to specify a hybrid (that is, linguistically-motivated and ontology-based) model of annotation suitable for the Semantic Web (i.e. it had to produce a semantic annotation of web page contents). This entailed that the tags included in the annotations of the model had to (1) represent linguistic concepts (or linguistic categories, as they are termed in ISO/DCR (2008)), in order for this model to be linguistically-motivated; (2) be ontological terms (i.e., use an ontological vocabulary), in order for the model to be ontology-based; and (3) be structured (linked) as a collection of ontology-based
Resumo:
This paper describes the UPM system for the Spanish-English translation task at the NAACL 2012 workshop on statistical machine translation. This system is based on Moses. We have used all available free corpora, cleaning and deleting some repetitions. In this paper, we also propose a technique for selecting the sentences for tuning the system. This technique is based on the similarity with the sentences to translate. With our approach, we improve the BLEU score from 28.37% to 28.57%. And as a result of the WMT12 challenge we have obtained a 31.80% BLEU with the 2012 test set. Finally, we explain different experiments that we have carried out after the competition.
Resumo:
Hoy día, en la era post genómica, los ensayos clínicos de cáncer implican la colaboración de diversas instituciones. El análisis multicéntrico y retrospectivo requiere de métodos avanzados para garantizar la interoperabilidad semántica. En este escenario, el objetivo de los proyectos EURECA e INTEGRATE es proporcionar una infraestructura para compartir conocimientos y datos de los ensayos clínicos post genómicos de cáncer. Debido en gran parte a la gran complejidad de los procesos colaborativos de las instituciones, provoca que la gestión de una información tan heterogénea sea un desafío dentro del área médica. Las tecnologías semánticas y las investigaciones relacionadas están centradas en búsqueda de conocimiento de la información extraída, permitiendo una mayor flexibilidad y usabilidad de los datos extraidos. Debido a la falta de estándares adoptados por estas entidades y la complejidad de los datos procedentes de ensayos clínicos, una capacidad semántica es esencial para asegurar la integración homogénea de esta información. De otra manera, los usuarios finales necesitarán conocer cada modelo y cada formato de dato de las instituciones participantes en cada estudio. Para proveer de una capa de interoperabilidad semántica, el primer paso es proponer un\Common Data Model" (CDM) que represente la información a almacenar, y un \Core Dataset" que permita el uso de múltiples terminologías como vocabulario compartido. Una vez que el \Core Dataset" y el CDM han sido seleccionados, la manera en la que realizar el mapping para unir los conceptos de una terminología dada al CDM, requiere de una mecanismo especial para realizar dicha labor. Dicho mecanismo, debe definir que conceptos de diferentes vocabularios pueden ser almacenados en determinados campos del modelo de datos, con la finalidad de crear una representación común de la información. El presente proyecto fin de grado, presenta el desarrollo de un servicio que implementa dicho mecanismo para vincular elementos de las terminologías médicas SNOMED CT, LOINC y HGNC, con objetos del \Health Level 7 Reference Information Model" (HL7 RIM). El servicio propuesto, y nombrado como TermBinding, sigue las recomendaciones del proyecto TermInfo del grupo HL7, pero también se tienen en cuenta cuestiones importantes que surgen al enlazar entre las citadas terminologas y el modelo de datos planteado. En este proceso de desarrollo de la interoperabilidad semántica en ensayos clínicos de cáncer, los datos de fuentes heterogéneas tienen que ser integrados, y es requisito que se deba habilitar una interfaz de acceso homogéneo a toda esta información. Para poder hacer unificar los datos provenientes de diferentes aplicaciones y bases de datos, es esencial representar todos estos datos de una manera canónica o normalizada. La estandarización de un determinado concepto de SNOMED CT, simplifica las recomendaciones del proyecto TermInfo del grupo HL7, utilizadas para poder almacenar cada concepto en el modelo de datos. Siguiendo este enfoque, la interoperabilidad semántica es conseguida con éxito para conceptos SNOMED CT, sean o no post o pre coordinados, así como para las terminologías LOINC y HGNC. Los conceptos son estandarizados en una forma normal que puede ser usada para unir los datos al \Common Data Model" basado en el RIM de HL7. Aunque existen limitaciones debido a la gran heterogeneidad de los datos a integrar, un primer prototipo del servicio propuesto se está utilizando con éxito en el contexto de los proyectos EURECA e INTEGRATE. Una mejora en la interoperabilidad semántica de los datos de ensayos clínicos de cáncer tiene como objetivo mejorar las prácticas en oncología.
Resumo:
El trabajo que se presenta a continuación desarrolla un modelo para calcular la distancia semántica entre dos oraciones representadas por grafos UNL. Este problema se plantea en el contexto de la traducción automática donde diferentes traductores pueden generar oraciones ligeramente diferentes partiendo del mismo original. La medida de distancia que se propone tiene como objetivo proporcionar una evaluación objetiva sobre la calidad del proceso de generación del texto. El autor realiza una exploración del estado del arte sobre esta materia, reuniendo en un único trabajo los modelos propuestos de distancia semántica entre conceptos, los modelos de comparación de grafos y las pocas propuestas realizadas para calcular distancias entre grafos conceptuales. También evalúa los pocos recursos disponibles para poder experimentar el modelo y plantea una metodología para generar los conjuntos de datos que permitirían aplicar la propuesta con el rigor científico necesario y desarrollar la experimentación. Utilizando las piezas anteriores se propone un modelo novedoso de comparación entre grafos conceptuales que permite utilizar diferentes algoritmos de distancia entre conceptos y establecer umbrales de tolerancia para permitir una comparación flexible entre las oraciones. Este modelo se programa utilizando C++, se alimenta con los recursos a los que se ha hecho referencia anteriormente, y se experimenta con un conjunto de oraciones creado por el autor ante la falta de otros recursos disponibles. Los resultados del modelo muestran que la metodología y la implementación pueden conducir a la obtención de una medida de distancia entre grafos UNL con aplicación en sistemas de traducción automática, sin embargo, la carencia de recursos y de datos etiquetados con los que validar el algoritmo requieren un esfuerzo previo importante antes de poder ofrecer resultados concluyentes.---ABSTRACT---The work presented here develops a model to calculate the semantic distance between two sentences represented by their UNL graphs. This problem arises in the context of machine translation where different translators can generate slightly different sentences from the same original. The distance measure that is proposed aims to provide an objective evaluation on the quality of the process involved in the generation of text. The author carries out an exploration of the state of the art on this subject, bringing together in a single work the proposed models of semantic distance between concepts, models for comparison of graphs and the few proposals made to calculate distances between conceptual graphs. It also assesses the few resources available to experience the model and presents a methodology to generate the datasets that would be needed to develop the proposal with the scientific rigor required and to carry out the experimentation. Using the previous parts a new model is proposed to compute differences between conceptual graphs; this model allows the use of different algorithms of distance between concepts and is parametrized in order to be able to perform a flexible comparison between the resulting sentences. This model is implemented in C++ programming language, it is powered with the resources referenced above and is experienced with a set of sentences created by the author due to the lack of other available resources. The results of the model show that the methodology and the implementation can lead to the achievement of a measure of distance between UNL graphs with application in machine translation systems, however, lack of resources and of labeled data to validate the algorithm requires an important effort to be done first in order to be able to provide conclusive results.
Resumo:
La tesis que se presenta tiene como propósito la construcción automática de ontologías a partir de textos, enmarcándose en el área denominada Ontology Learning. Esta disciplina tiene como objetivo automatizar la elaboración de modelos de dominio a partir de fuentes información estructurada o no estructurada, y tuvo su origen con el comienzo del milenio, a raíz del crecimiento exponencial del volumen de información accesible en Internet. Debido a que la mayoría de información se presenta en la web en forma de texto, el aprendizaje automático de ontologías se ha centrado en el análisis de este tipo de fuente, nutriéndose a lo largo de los años de técnicas muy diversas provenientes de áreas como la Recuperación de Información, Extracción de Información, Sumarización y, en general, de áreas relacionadas con el procesamiento del lenguaje natural. La principal contribución de esta tesis consiste en que, a diferencia de la mayoría de las técnicas actuales, el método que se propone no analiza la estructura sintáctica superficial del lenguaje, sino que estudia su nivel semántico profundo. Su objetivo, por tanto, es tratar de deducir el modelo del dominio a partir de la forma con la que se articulan los significados de las oraciones en lenguaje natural. Debido a que el nivel semántico profundo es independiente de la lengua, el método permitirá operar en escenarios multilingües, en los que es necesario combinar información proveniente de textos en diferentes idiomas. Para acceder a este nivel del lenguaje, el método utiliza el modelo de las interlinguas. Estos formalismos, provenientes del área de la traducción automática, permiten representar el significado de las oraciones de forma independiente de la lengua. Se utilizará en concreto UNL (Universal Networking Language), considerado como la única interlingua de propósito general que está normalizada. La aproximación utilizada en esta tesis supone la continuación de trabajos previos realizados tanto por su autor como por el equipo de investigación del que forma parte, en los que se estudió cómo utilizar el modelo de las interlinguas en las áreas de extracción y recuperación de información multilingüe. Básicamente, el procedimiento definido en el método trata de identificar, en la representación UNL de los textos, ciertas regularidades que permiten deducir las piezas de la ontología del dominio. Debido a que UNL es un formalismo basado en redes semánticas, estas regularidades se presentan en forma de grafos, generalizándose en estructuras denominadas patrones lingüísticos. Por otra parte, UNL aún conserva ciertos mecanismos de cohesión del discurso procedentes de los lenguajes naturales, como el fenómeno de la anáfora. Con el fin de aumentar la efectividad en la comprensión de las expresiones, el método provee, como otra contribución relevante, la definición de un algoritmo para la resolución de la anáfora pronominal circunscrita al modelo de la interlingua, limitada al caso de pronombres personales de tercera persona cuando su antecedente es un nombre propio. El método propuesto se sustenta en la definición de un marco formal, que ha debido elaborarse adaptando ciertas definiciones provenientes de la teoría de grafos e incorporando otras nuevas, con el objetivo de ubicar las nociones de expresión UNL, patrón lingüístico y las operaciones de encaje de patrones, que son la base de los procesos del método. Tanto el marco formal como todos los procesos que define el método se han implementado con el fin de realizar la experimentación, aplicándose sobre un artículo de la colección EOLSS “Encyclopedia of Life Support Systems” de la UNESCO. ABSTRACT The purpose of this thesis is the automatic construction of ontologies from texts. This thesis is set within the area of Ontology Learning. This discipline aims to automatize domain models from structured or unstructured information sources, and had its origin with the beginning of the millennium, as a result of the exponential growth in the volume of information accessible on the Internet. Since most information is presented on the web in the form of text, the automatic ontology learning is focused on the analysis of this type of source, nourished over the years by very different techniques from areas such as Information Retrieval, Information Extraction, Summarization and, in general, by areas related to natural language processing. The main contribution of this thesis consists of, in contrast with the majority of current techniques, the fact that the method proposed does not analyze the syntactic surface structure of the language, but explores his deep semantic level. Its objective, therefore, is trying to infer the domain model from the way the meanings of the sentences are articulated in natural language. Since the deep semantic level does not depend on the language, the method will allow to operate in multilingual scenarios, where it is necessary to combine information from texts in different languages. To access to this level of the language, the method uses the interlingua model. These formalisms, coming from the area of machine translation, allow to represent the meaning of the sentences independently of the language. In this particular case, UNL (Universal Networking Language) will be used, which considered to be the only interlingua of general purpose that is standardized. The approach used in this thesis corresponds to the continuation of previous works carried out both by the author of this thesis and by the research group of which he is part, in which it is studied how to use the interlingua model in the areas of multilingual information extraction and retrieval. Basically, the procedure defined in the method tries to identify certain regularities at the UNL representation of texts that allow the deduction of the parts of the ontology of the domain. Since UNL is a formalism based on semantic networks, these regularities are presented in the form of graphs, generalizing in structures called linguistic patterns. On the other hand, UNL still preserves certain mechanisms of discourse cohesion from natural languages, such as the phenomenon of the anaphora. In order to increase the effectiveness in the understanding of expressions, the method provides, as another significant contribution, the definition of an algorithm for the resolution of pronominal anaphora limited to the model of the interlingua, in the case of third person personal pronouns when its antecedent is a proper noun. The proposed method is based on the definition of a formal framework, adapting some definitions from Graph Theory and incorporating new ones, in order to locate the notions of UNL expression and linguistic pattern, as well as the operations of pattern matching, which are the basis of the method processes. Both the formal framework and all the processes that define the method have been implemented in order to carry out the experimentation, applying on an article of the "Encyclopedia of Life Support Systems" of the UNESCO-EOLSS collection.
Resumo:
This paper tells about the recognition of temporal expressions and the resolution of their temporal reference. A proposal of the units we have used to face up this tasks over a restricted domain is shown. We work with newspapers' articles in Spanish, that is why every reference we use is in Spanish. For the identification and recognition of temporal expressions we base on a temporal expression grammar and for the resolution on a dictionary, where we have the information necessary to do the date operation based on the recognized expressions. In the evaluation of our proposal we have obtained successful results for the examples studied.
Resumo:
In the last few years, there has been a wide development in the research on textual information systems. The goal is to improve these systems in order to allow an easy localization, treatment and access to the information stored in digital format (Digital Databases, Documental Databases, and so on). There are lots of applications focused on information access (for example, Web-search systems like Google or Altavista). However, these applications have problems when they must access to cross-language information, or when they need to show information in a language different from the one of the query. This paper explores the use of syntactic-sematic patterns as a method to access to multilingual information, and revise, in the case of Information Retrieval, where it is possible and useful to employ patterns when it comes to the multilingual and interactive aspects. On the one hand, the multilingual aspects that are going to be studied are the ones related to the access to documents in different languages from the one of the query, as well as the automatic translation of the document, i.e. a machine translation system based on patterns. On the other hand, this paper is going to go deep into the interactive aspects related to the reformulation of a query based on the syntactic-semantic pattern of the request.