5 resultados para Parallel Corpus
em Universidad Politécnica de Madrid
Resumo:
This paper describes a preprocessing module for improving the performance of a Spanish into Spanish Sign Language (Lengua de Signos Espanola: LSE) translation system when dealing with sparse training data. This preprocessing module replaces Spanish words with associated tags. The list with Spanish words (vocabulary) and associated tags used by this module is computed automatically considering those signs that show the highest probability of being the translation of every Spanish word. This automatic tag extraction has been compared to a manual strategy achieving almost the same improvement. In this analysis, several alternatives for dealing with non-relevant words have been studied. Non-relevant words are Spanish words not assigned to any sign. The preprocessing module has been incorporated into two well-known statistical translation architectures: a phrase-based system and a Statistical Finite State Transducer (SFST). This system has been developed for a specific application domain: the renewal of Identity Documents and Driver's License. In order to evaluate the system a parallel corpus made up of 4080 Spanish sentences and their LSE translation has been used. The evaluation results revealed a significant performance improvement when including this preprocessing module. In the phrase-based system, the proposed module has given rise to an increase in BLEU (Bilingual Evaluation Understudy) from 73.8% to 81.0% and an increase in the human evaluation score from 0.64 to 0.83. In the case of SFST, BLEU increased from 70.6% to 78.4% and the human evaluation score from 0.65 to 0.82.
Resumo:
This paper proposes a methodology for developing a speech into sign language translation system considering a user-centered strategy. This method-ology consists of four main steps: analysis of technical and user requirements, data collection, technology adaptation to the new domain, and finally, evalua-tion of the system. The two most demanding tasks are the sign generation and the translation rules generation. Many other aspects can be updated automatical-ly from a parallel corpus that includes sentences (in Spanish and LSE: Lengua de Signos Española) related to the application domain. In this paper, we explain how to apply this methodology in order to develop two translation systems in two specific domains: bus transport information and hotel reception.
Resumo:
This paper presents a methodology for adapting an advanced communication system for deaf people in a new domain. This methodology is a user-centered design approach consisting of four main steps: requirement analysis, parallel corpus generation, technology adaptation to the new domain, and finally, system evaluation. In this paper, the new considered domain has been the dialogues in a hotel reception. With this methodology, it was possible to develop the system in a few months, obtaining very good performance: good speech recognition and translation rates (around 90%) with small processing times.
Resumo:
La principal aportación de esta tesis doctoral ha sido la propuesta y evaluación de un sistema de traducción automática que permite la comunicación entre personas oyentes y sordas. Este sistema está formado a su vez por dos sistemas: un traductor de habla en español a Lengua de Signos Española (LSE) escrita y que posteriormente se representa mediante un agente animado; y un generador de habla en español a partir de una secuencia de signos escritos mediante glosas. El primero de ellos consta de un reconocedor de habla, un módulo de traducción entre lenguas y un agente animado que representa los signos en LSE. El segundo sistema está formado por una interfaz gráfica donde se puede especificar una secuencia de signos mediante glosas (palabras en mayúscula que representan los signos), un módulo de traducción entre lenguas y un conversor texto-habla. Para el desarrollo del sistema de traducción, en primer lugar se ha generado un corpus paralelo de 7696 frases en español con sus correspondientes traducciones a LSE. Estas frases pertenecen a cuatro dominios de aplicación distintos: la renovación del Documento Nacional de Identidad, la renovación del permiso de conducir, un servicio de información de autobuses urbanos y la recepción de un hotel. Además, se ha generado una base de datos con más de 1000 signos almacenados en cuatro sistemas distintos de signo-escritura. En segundo lugar, se ha desarrollado un módulo de traducción automática que integra dos técnicas de traducción con una estructura jerárquica: la primera basada en memoria y la segunda estadística. Además, se ha implementado un módulo de pre-procesamiento de las frases en español que, mediante su incorporación al módulo de traducción estadística, permite mejorar significativamente la tasa de traducción. En esta tesis también se ha mejorado la versión de la interfaz de traducción de LSE a habla. Por un lado, se han incorporado nuevas características que mejoran su usabilidad y, por otro, se ha integrado un traductor de lenguaje SMS (Short Message Service – Servicio de Mensajes Cortos) a español, que permite especificar la secuencia a traducir en lenguaje SMS, además de mediante una secuencia de glosas. El sistema de traducción propuesto se ha evaluado con usuarios reales en dos dominios de aplicación: un servicio de información de autobuses de la Empresa Municipal de Transportes de Madrid y la recepción del Hotel Intur Palacio San Martín de Madrid. En la evaluación estuvieron implicadas personas sordas y empleados de los dos servicios. Se extrajeron medidas objetivas (obtenidas por el sistema automáticamente) y subjetivas (mediante cuestionarios a los usuarios). Los resultados fueron muy positivos gracias a la opinión de los usuarios de la evaluación, que validaron el funcionamiento del sistema de traducción y dieron información valiosa para futuras líneas de trabajo. Por otro lado, tras la integración de cada uno de los módulos de los dos sistemas de traducción (habla-LSE y LSE-habla), los resultados de la evaluación y la experiencia adquirida en todo el proceso, una aportación importante de esta tesis doctoral es la propuesta de metodología de desarrollo de sistemas de traducción de habla a lengua de signos en los dos sentidos de la comunicación. En esta metodología se detallan los pasos a seguir para desarrollar el sistema de traducción para un nuevo dominio de aplicación. Además, la metodología describe cómo diseñar cada uno de los módulos del sistema para mejorar su flexibilidad, de manera que resulte más sencillo adaptar el sistema desarrollado a un nuevo dominio de aplicación. Finalmente, en esta tesis se analizan algunas técnicas para seleccionar las frases de un corpus paralelo fuera de dominio para entrenar el modelo de traducción cuando se quieren traducir frases de un nuevo dominio de aplicación; así como técnicas para seleccionar qué frases del nuevo dominio resultan más interesantes que traduzcan los expertos en LSE para entrenar el modelo de traducción. El objetivo es conseguir una buena tasa de traducción con la menor cantidad posible de frases. ABSTRACT The main contribution of this thesis has been the proposal and evaluation of an automatic translation system for improving the communication between hearing and deaf people. This system is made up of two systems: a Spanish into Spanish Sign Language (LSE – Lengua de Signos Española) translator and a Spanish generator from LSE sign sequences. The first one consists of a speech recognizer, a language translation module and an avatar that represents the sign sequence. The second one is made up an interface for specifying the sign sequence, a language translation module and a text-to-speech conversor. For the translation system development, firstly, a parallel corpus has been generated with 7,696 Spanish sentences and their LSE translations. These sentences are related to four different application domains: the renewal of the Identity Document, the renewal of the driver license, a bus information service and a hotel reception. Moreover, a sign database has been generated with more than 1,000 signs described in four different signwriting systems. Secondly, it has been developed an automatic translation module that integrates two translation techniques in a hierarchical structure: the first one is a memory-based technique and the second one is statistical. Furthermore, a pre processing module for the Spanish sentences has been implemented. By incorporating this pre processing module into the statistical translation module, the accuracy of the translation module improves significantly. In this thesis, the LSE into speech translation interface has been improved. On the one hand, new characteristics that improve its usability have been incorporated and, on the other hand, a SMS language into Spanish translator has been integrated, that lets specifying in SMS language the sequence to translate, besides by specifying a sign sequence. The proposed translation system has been evaluated in two application domains: a bus information service of the Empresa Municipal de Transportes of Madrid and the Hotel Intur Palacio San Martín reception. This evaluation has involved both deaf people and services employees. Objective measurements (given automatically by the system) and subjective measurements (given by user questionnaires) were extracted during the evaluation. Results have been very positive, thanks to the user opinions during the evaluation that validated the system performance and gave important information for future work. Finally, after the integration of each module of the two translation systems (speech- LSE and LSE-speech), obtaining the evaluation results and considering the experience throughout the process, a methodology for developing speech into sign language (and vice versa) into a new domain has been proposed in this thesis. This methodology includes the steps to follow for developing the translation system in a new application domain. Moreover, this methodology proposes the way to improve the flexibility of each system module, so that the adaptation of the system to a new application domain can be easier. On the other hand, some techniques are analyzed for selecting the out-of-domain parallel corpus sentences in order to train the translation module in a new domain; as well as techniques for selecting which in-domain sentences are more interesting for translating them (by LSE experts) in order to train the translation model.
Resumo:
A methodology for developing an advanced communications system for the Deaf in a new domain is presented in this paper. This methodology is a user-centred design approach consisting of four main steps: requirement analysis, parallel corpus generation, technology adaptation to the new domain, and finally, system evaluation. During the requirement analysis, both the user and technical requirements are evaluated and defined. For generating the parallel corpus, it is necessary to collect Spanish sentences in the new domain and translate them into LSE (Lengua de Signos Española: Spanish Sign Language). LSE is represented by glosses and using video recordings. This corpus is used for training the two main modules of the advanced communications system to the new domain: the spoken Spanish into the LSE translation module and the Spanish generation from the LSE module. The main aspects to be generated are the vocabularies for both languages (Spanish words and signs), and the knowledge for translating in both directions. Finally, the field evaluation is carried out with deaf people using the advanced communications system to interact with hearing people in several scenarios. In this evaluation, the paper proposes several objective and subjective measurements for evaluating the performance. In this paper, the new considered domain is about dialogues in a hotel reception. Using this methodology, the system was developed in several months, obtaining very good performance: good translation rates (10% Sign Error Rate) with small processing times, allowing face-to-face dialogues.