21 resultados para Speech Dialog Systems
em Universidad Politécnica de Madrid
Resumo:
Detecting user affect automatically during real-time conversation is the main challenge towards our greater aim of infusing social intelligence into a natural-language mixed-initiative High-Fidelity (Hi-Fi) audio control spoken dialog agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labelling and machine prediction. This paper attempts to address part of this challenge by considering the role of user satisfaction ratings and also conversational/dialog features in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. However, given the laboratory constraints, users might be positively biased when rating the system, indirectly making the reliability of the satisfaction data questionable. Machine learning experiments were conducted on two datasets, users and annotators, which were then compared in order to assess the reliability of these datasets. Our results indicated that standard classifiers were significantly more successful in discriminating the abovementioned emotions and their intensities (reflected by user satisfaction ratings) from annotator data than from user data. These results corroborated that: first, satisfaction data could be used directly as an alternative target variable to model affect, and that they could be predicted exclusively by dialog features. Second, these were only true when trying to predict the abovementioned emotions using annotator?s data, suggesting that user bias does exist in a laboratory-led evaluation.
Resumo:
This paper presents an empirical evidence of user bias within a laboratory-oriented evaluation of a Spoken Dialog System. Specifically, we addressed user bias in their satisfaction judgements. We question the reliability of this data for modeling user emotion, focusing on contentment and frustration in a spoken dialog system. This bias is detected through machine learning experiments that were conducted on two datasets, users and annotators, which were then compared in order to assess the reliability of these datasets. The target used was the satisfaction rating and the predictors were conversational/dialog features. Our results indicated that standard classifiers were significantly more successful in discriminating frustration and contentment and the intensities of these emotions (reected by user satisfaction ratings) from annotator data than from user data. Indirectly, the results showed that conversational features are reliable predictors of the two abovementioned emotions.
Resumo:
We present a novel approach for the detection of severe obstructive sleep apnea (OSA) based on patients' voices introducing nonlinear measures to describe sustained speech dynamics. Nonlinear features were combined with state-of-the-art speech recognition systems using statistical modeling techniques (Gaussian mixture models, GMMs) over cepstral parameterization (MFCC) for both continuous and sustained speech. Tests were performed on a database including speech records from both severe OSA and control speakers. A 10 % relative reduction in classification error was obtained for sustained speech when combining MFCC-GMM and nonlinear features, and 33 % when fusing nonlinear features with both sustained and continuous MFCC-GMM. Accuracy reached 88.5 % allowing the system to be used in OSA early detection. Tests showed that nonlinear features and MFCCs are lightly correlated on sustained speech, but uncorrelated on continuous speech. Results also suggest the existence of nonlinear effects in OSA patients' voices, which should be found in continuous speech.
Resumo:
Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.
Resumo:
La ltima dcada ha sido testigo de importantes avances en el campo de la tecnologa de reconocimiento de voz. Los sistemas comerciales existentes actualmente poseen la capacidad de reconocer habla continua de mltiples locutores, consiguiendo valores aceptables de error, y sin la necesidad de realizar procedimientos explcitos de adaptacin. A pesar del buen momento que vive esta tecnologa, el reconocimiento de voz dista de ser un problema resuelto. La mayora de estos sistemas de reconocimiento se ajustan a dominios particulares y su eficacia depende de manera significativa, entre otros muchos aspectos, de la similitud que exista entre el modelo de lenguaje utilizado y la tarea especfica para la cual se est empleando. Esta dependencia cobra an ms importancia en aquellos escenarios en los cuales las propiedades estadsticas del lenguaje varan a lo largo del tiempo, como por ejemplo, en dominios de aplicacin que involucren habla espontnea y mltiples temticas. En los ltimos aos se ha evidenciado un constante esfuerzo por mejorar los sistemas de reconocimiento para tales dominios. Esto se ha hecho, entre otros muchos enfoques, a travs de tcnicas automticas de adaptacin. Estas tcnicas son aplicadas a sistemas ya existentes, dado que exportar el sistema a una nueva tarea o dominio puede requerir tiempo a la vez que resultar costoso. Las tcnicas de adaptacin requieren fuentes adicionales de informacin, y en este sentido, el lenguaje hablado puede aportar algunas de ellas. El habla no slo transmite un mensaje, tambin transmite informacin acerca del contexto en el cual se desarrolla la comunicacin hablada (e.g. acerca del tema sobre el cual se est hablando). Por tanto, cuando nos comunicamos a travs del habla, es posible identificar los elementos del lenguaje que caracterizan el contexto, y al mismo tiempo, rastrear los cambios que ocurren en estos elementos a lo largo del tiempo. Esta informacin podra ser capturada y aprovechada por medio de tcnicas de recuperacin de informacin (information retrieval) y de aprendizaje de mquina (machine learning). Esto podra permitirnos, dentro del desarrollo de mejores sistemas automticos de reconocimiento de voz, mejorar la adaptacin de modelos del lenguaje a las condiciones del contexto, y por tanto, robustecer al sistema de reconocimiento en dominios con condiciones variables (tales como variaciones potenciales en el vocabulario, el estilo y la temtica). En este sentido, la principal contribucin de esta Tesis es la propuesta y evaluacin de un marco de contextualizacin motivado por el anlisis temtico y basado en la adaptacin dinmica y no supervisada de modelos de lenguaje para el robustecimiento de un sistema automtico de reconocimiento de voz. Esta adaptacin toma como base distintos enfoque de los sistemas mencionados (de recuperacin de informacin y aprendizaje de mquina) mediante los cuales buscamos identificar las temticas sobre las cuales se est hablando en una grabacin de audio. Dicha identificacin, por lo tanto, permite realizar una adaptacin del modelo de lenguaje de acuerdo a las condiciones del contexto. El marco de contextualizacin propuesto se puede dividir en dos sistemas principales: un sistema de identificacin de temtica y un sistema de adaptacin dinmica de modelos de lenguaje. Esta Tesis puede describirse en detalle desde la perspectiva de las contribuciones particulares realizadas en cada uno de los campos que componen el marco propuesto: _ En lo referente al sistema de identificacin de temtica, nos hemos enfocado en aportar mejoras a las tcnicas de pre-procesamiento de documentos, asimismo en contribuir a la definicin de criterios ms robustos para la seleccin de index-terms. La eficiencia de los sistemas basados tanto en tcnicas de recuperacin de informacin como en tcnicas de aprendizaje de mquina, y especficamente de aquellos sistemas que particularizan en la tarea de identificacin de temtica, depende, en gran medida, de los mecanismos de preprocesamiento que se aplican a los documentos. Entre las mltiples operaciones que hacen parte de un esquema de preprocesamiento, la seleccin adecuada de los trminos de indexado (index-terms) es crucial para establecer relaciones semnticas y conceptuales entre los trminos y los documentos. Este proceso tambin puede verse afectado, o bien por una mala eleccin de stopwords, o bien por la falta de precisin en la definicin de reglas de lematizacin. En este sentido, en este trabajo comparamos y evaluamos diferentes criterios para el preprocesamiento de los documentos, as como tambin distintas estrategias para la seleccin de los index-terms. Esto nos permite no slo reducir el tamao de la estructura de indexacin, sino tambin mejorar el proceso de identificacin de temtica. Uno de los aspectos ms importantes en cuanto al rendimiento de los sistemas de identificacin de temtica es la asignacin de diferentes pesos a los trminos de acuerdo a su contribucin al contenido del documento. En este trabajo evaluamos y proponemos enfoques alternativos a los esquemas tradicionales de ponderado de trminos (tales como tf-idf ) que nos permitan mejorar la especificidad de los trminos, as como tambin discriminar mejor las temticas de los documentos. _ Respecto a la adaptacin dinmica de modelos de lenguaje, hemos dividimos el proceso de contextualizacin en varios pasos. Para la generacin de modelos de lenguaje basados en temtica, proponemos dos tipos de enfoques: un enfoque supervisado y un enfoque no supervisado. En el primero de ellos nos basamos en las etiquetas de temtica que originalmente acompaan a los documentos del corpus que empleamos. A partir de estas, agrupamos los documentos que forman parte de la misma temtica y generamos modelos de lenguaje a partir de dichos grupos. Sin embargo, uno de los objetivos que se persigue en esta Tesis es evaluar si el uso de estas etiquetas para la generacin de modelos es ptimo en trminos del rendimiento del reconocedor. Por esta razn, nosotros proponemos un segundo enfoque, un enfoque no supervisado, en el cual el objetivo es agrupar, automticamente, los documentos en clusters temticos, basndonos en la similaridad semntica existente entre los documentos. Por medio de enfoques de agrupamiento conseguimos mejorar la cohesin conceptual y semntica en cada uno de los clusters, lo que a su vez nos permiti refinar los modelos de lenguaje basados en temtica y mejorar el rendimiento del sistema de reconocimiento. Desarrollamos diversas estrategias para generar un modelo de lenguaje dependiente del contexto. Nuestro objetivo es que este modelo refleje el contexto semntico del habla, i.e. las temticas ms relevantes que se estn discutiendo. Este modelo es generado por medio de la interpolacin lineal entre aquellos modelos de lenguaje basados en temtica que estn relacionados con las temticas ms relevantes. La estimacin de los pesos de interpolacin est basada principalmente en el resultado del proceso de identificacin de temtica. Finalmente, proponemos una metodologa para la adaptacin dinmica de un modelo de lenguaje general. El proceso de adaptacin tiene en cuenta no slo al modelo dependiente del contexto sino tambin a la informacin entregada por el proceso de identificacin de temtica. El esquema usado para la adaptacin es una interpolacin lineal entre el modelo general y el modelo dependiente de contexto. Estudiamos tambin diferentes enfoques para determinar los pesos de interpolacin entre ambos modelos. Una vez definida la base terica de nuestro marco de contextualizacin, proponemos su aplicacin dentro de un sistema automtico de reconocimiento de voz. Para esto, nos enfocamos en dos aspectos: la contextualizacin de los modelos de lenguaje empleados por el sistema y la incorporacin de informacin semntica en el proceso de adaptacin basado en temtica. En esta Tesis proponemos un marco experimental basado en una arquitectura de reconocimiento en dos etapas. En la primera etapa, empleamos sistemas basados en tcnicas de recuperacin de informacin y aprendizaje de mquina para identificar las temticas sobre las cuales se habla en una transcripcin de un segmento de audio. Esta transcripcin es generada por el sistema de reconocimiento empleando un modelo de lenguaje general. De acuerdo con la relevancia de las temticas que han sido identificadas, se lleva a cabo la adaptacin dinmica del modelo de lenguaje. En la segunda etapa de la arquitectura de reconocimiento, usamos este modelo adaptado para realizar de nuevo el reconocimiento del segmento de audio. Para determinar los beneficios del marco de trabajo propuesto, llevamos a cabo la evaluacin de cada uno de los sistemas principales previamente mencionados. Esta evaluacin es realizada sobre discursos en el dominio de la poltica usando la base de datos EPPS (European Parliamentary Plenary Sessions - Sesiones Plenarias del Parlamento Europeo) del proyecto europeo TC-STAR. Analizamos distintas mtricas acerca del rendimiento de los sistemas y evaluamos las mejoras propuestas con respecto a los sistemas de referencia. ABSTRACT The last decade has witnessed major advances in speech recognition technology. Todays commercial systems are able to recognize continuous speech from numerous speakers, with acceptable levels of error and without the need for an explicit adaptation procedure. Despite this progress, speech recognition is far from being a solved problem. Most of these systems are adjusted to a particular domain and their efficacy depends significantly, among many other aspects, on the similarity between the language model used and the task that is being addressed. This dependence is even more important in scenarios where the statistical properties of the language fluctuates throughout the time, for example, in application domains involving spontaneous and multitopic speech. Over the last years there has been an increasing effort in enhancing the speech recognition systems for such domains. This has been done, among other approaches, by means of techniques of automatic adaptation. These techniques are applied to the existing systems, specially since exporting the system to a new task or domain may be both time-consuming and expensive. Adaptation techniques require additional sources of information, and the spoken language could provide some of them. It must be considered that speech not only conveys a message, it also provides information on the context in which the spoken communication takes place (e.g. on the subject on which it is being talked about). Therefore, when we communicate through speech, it could be feasible to identify the elements of the language that characterize the context, and at the same time, to track the changes that occur in those elements over time. This information can be extracted and exploited through techniques of information retrieval and machine learning. This allows us, within the development of more robust speech recognition systems, to enhance the adaptation of language models to the conditions of the context, thus strengthening the recognition system for domains under changing conditions (such as potential variations in vocabulary, style and topic). In this sense, the main contribution of this Thesis is the proposal and evaluation of a framework of topic-motivated contextualization based on the dynamic and non-supervised adaptation of language models for the enhancement of an automatic speech recognition system. This adaptation is based on an combined approach (from the perspective of both information retrieval and machine learning fields) whereby we identify the topics that are being discussed in an audio recording. The topic identification, therefore, enables the system to perform an adaptation of the language model according to the contextual conditions. The proposed framework can be divided in two major systems: a topic identification system and a dynamic language model adaptation system. This Thesis can be outlined from the perspective of the particular contributions made in each of the fields that composes the proposed framework: _ Regarding the topic identification system, we have focused on the enhancement of the document preprocessing techniques in addition to contributing in the definition of more robust criteria for the selection of index-terms. Within both information retrieval and machine learning based approaches, the efficiency of topic identification systems, depends, to a large extent, on the mechanisms of preprocessing applied to the documents. Among the many operations that encloses the preprocessing procedures, an adequate selection of index-terms is critical to establish conceptual and semantic relationships between terms and documents. This process might also be weakened by a poor choice of stopwords or lack of precision in defining stemming rules. In this regard we compare and evaluate different criteria for preprocessing the documents, as well as for improving the selection of the index-terms. This allows us to not only reduce the size of the indexing structure but also to strengthen the topic identification process. One of the most crucial aspects, in relation to the performance of topic identification systems, is to assign different weights to different terms depending on their contribution to the content of the document. In this sense we evaluate and propose alternative approaches to traditional weighting schemes (such as tf-idf ) that allow us to improve the specificity of terms, and to better identify the topics that are related to documents. _ Regarding the dynamic language model adaptation, we divide the contextualization process into different steps. We propose supervised and unsupervised approaches for the generation of topic-based language models. The first of them is intended to generate topic-based language models by grouping the documents, in the training set, according to the original topic labels of the corpus. Nevertheless, a goal of this Thesis is to evaluate whether or not the use of these labels to generate language models is optimal in terms of recognition accuracy. For this reason, we propose a second approach, an unsupervised one, in which the objective is to group the data in the training set into automatic topic clusters based on the semantic similarity between the documents. By means of clustering approaches we expect to obtain a more cohesive association of the documents that are related by similar concepts, thus improving the coverage of the topic-based language models and enhancing the performance of the recognition system. We develop various strategies in order to create a context-dependent language model. Our aim is that this model reflects the semantic context of the current utterance, i.e. the most relevant topics that are being discussed. This model is generated by means of a linear interpolation between the topic-based language models related to the most relevant topics. The estimation of the interpolation weights is based mainly on the outcome of the topic identification process. Finally, we propose a methodology for the dynamic adaptation of a background language model. The adaptation process takes into account the context-dependent model as well as the information provided by the topic identification process. The scheme used for the adaptation is a linear interpolation between the background model and the context-dependent one. We also study different approaches to determine the interpolation weights used in this adaptation scheme. Once we defined the basis of our topic-motivated contextualization framework, we propose its application into an automatic speech recognition system. We focus on two aspects: the contextualization of the language models used by the system, and the incorporation of semantic-related information into a topic-based adaptation process. To achieve this, we propose an experimental framework based in a two stages recognition architecture. In the first stage of the architecture, Information Retrieval and Machine Learning techniques are used to identify the topics in a transcription of an audio segment. This transcription is generated by the recognition system using a background language model. According to the confidence on the topics that have been identified, the dynamic language model adaptation is carried out. In the second stage of the recognition architecture, an adapted language model is used to re-decode the utterance. To test the benefits of the proposed framework, we carry out the evaluation of each of the major systems aforementioned. The evaluation is conducted on speeches of political domain using the EPPS (European Parliamentary Plenary Sessions) database from the European TC-STAR project. We analyse several performance metrics that allow us to compare the improvements of the proposed systems against the baseline ones.
Resumo:
It is easy to get frustrated at spoken conversational agents (SCAs), perhaps because they seem to be callous. By and large, the quality of human-computer interaction is affected due to the inability of the SCAs to recognise and adapt to user emotional state. Now with the mass appeal of artificially-mediated communication, there has been an increasing need for SCAs to be socially and emotionally intelligent, that is, to infer and adapt to their human interlocutors emotions on the fly, in order to ascertain an affective, empathetic and naturalistic interaction. An enhanced quality of interaction would reduce users frustrations and consequently increase their satisfactions. These reasons have motivated the development of SCAs towards including socio-emotional elements, turning them into affective and socially-sensitive interfaces. One barrier to the creation of such interfaces has been the lack of methods for modelling emotions in a task-independent environment. Most emotion models for spoken dialog systems are task-dependent and thus cannot be used as-is in different applications. This Thesis focuses on improving this, in which it concerns computational modeling of emotion, personality and their interrelationship for task-independent autonomous SCAs. The generation of emotion is driven by needs, inspired by humans motivational systems. The work in this Thesis is organised in three stages, each one with its own contribution. The first stage involved defining, integrating and quantifying the psychological-based motivational and emotional models sourced from. Later these were transformed into a computational model by implementing them into software entities. The computational model was then incorporated and put to test with an existing SCA host, a HiFi-control agent. The second stage concerned automatic prediction of affect, which has been the main challenge towards the greater aim of infusing social intelligence into the HiFi agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labelling and machine prediction. In this stage, we attempted to address part of this challenge by considering the roles of user satisfaction ratings and conversational/dialog features as the respective target and predictors in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. The final stage concerned the evaluation of the emotional model through the HiFi agent. A series of user studies with 70 subjects were conducted in a real-time environment, each in a different phase and with its own conditions. All the studies involved the comparisons between the baseline non-modified and the modified agent. The findings have gone some way towards enhancing our understanding of the utility of emotion in spoken dialog systems in several ways; first, an SCA should not express its emotions blindly, albeit positive. Rather, it should adapt its emotions to user states. Second, low performance in an SCA may be compensated by the exploitation of emotion. Third, the expression of emotion through the exploitation of prosody could better improve users perceptions of an SCA compared to exploiting emotions through just lexical contents. Taken together, these findings not only support the success of the emotional model, but also provide substantial evidences with respect to the benefits of adding emotion in an SCA, especially in mitigating users frustrations and ultimately improving their satisfactions. Resumen Es relativamente fcil experimentar cierta frustracin al interaccionar con agentes conversacionales (Spoken Conversational Agents, SCA), a menudo porque parecen ser un poco insensibles. En general, la calidad de la interaccin persona-agente se ve en cierto modo afectada por la incapacidad de los SCAs para identificar y adaptarse al estado emocional de sus usuarios. Actualmente, y debido al creciente atractivo e inters de dichos agentes, surge la necesidad de hacer de los SCAs unos seres cada vez ms sociales y emocionalmente inteligentes, es decir, con capacidad para inferir y adaptarse a las emociones de sus interlocutores humanos sobre la marcha, de modo que la interaccin resulte ms afectiva, emptica y, en definitiva, natural. Una interaccin mejorada en este sentido permitira reducir la posible frustracin de los usuarios y, en consecuencia, mejorar el nivel de satisfaccin alcanzado por los mismos. Estos argumentos justifican y motivan el desarrollo de nuevos SCAs con capacidades socio-emocionales, dotados de interfaces afectivas y socialmente sensibles. Una de las barreras para la creacin de tales interfaces ha sido la falta de mtodos de modelado de emociones en entornos independientes de tarea. La mayora de los modelos emocionales empleados por los sistemas de dilogo hablado actuales son dependientes de tarea y, por tanto, no pueden utilizarse "tal cual" en diferentes dominios o aplicaciones. Esta tesis se centra precisamente en la mejora de este aspecto, la definicin de modelos computacionales de las emociones, la personalidad y su interrelacin para SCAs autnomos e independientes de tarea. Inspirada en los sistemas motivacionales humanos en el mbito de la psicologa, la tesis propone un modelo de generacin/produccin de la emocin basado en necesidades. El trabajo realizado en la presente tesis est organizado en tres etapas diferenciadas, cada una con su propia contribucin. La primera etapa incluy la definicin, integracin y cuantificacin de los modelos motivacionales de partida y de los modelos emocionales derivados a partir de stos. Posteriormente, dichos modelos emocionales fueron plasmados en un modelo computacional mediante su implementacin software. Este modelo computacional fue incorporado y probado en un SCA anfitrin ya existente, un agente con capacidad para controlar un equipo HiFi, de alta fidelidad. La segunda etapa se orient hacia el reconocimiento automtico de la emocin, aspecto que ha constituido el principal desafo en relacin al objetivo mayor de infundir inteligencia social en el agente HiFi. En los ltimos aos, los estudios sobre reconocimiento de emociones a partir de la voz han pasado de emplear datos actuados a usar datos reales en los que la presencia u observacin de emociones se produce de una manera mucho ms sutil. El reconocimiento de emociones bajo estas condiciones resulta mucho ms complicado y esta dificultad se pone de manifiesto en tareas tales como el etiquetado y el aprendizaje automtico. En esta etapa, se abord el problema del reconocimiento de las emociones del usuario a partir de caractersticas o mtricas derivadas del propio dilogo usuario-agente. Gracias a dichas mtricas, empleadas como predictores o indicadores del grado o nivel de satisfaccin alcanzado por el usuario, fue posible discriminar entre satisfaccin y frustracin, las dos emociones prevalentes durante la interaccin usuario-agente. La etapa final corresponde fundamentalmente a la evaluacin del modelo emocional por medio del agente Hifi. Con ese propsito se llev a cabo una serie de estudios con usuarios reales, 70 sujetos, interaccionando con diferentes versiones del agente Hifi en tiempo real, cada uno en una fase diferente y con sus propias caractersticas o capacidades emocionales. En particular, todos los estudios realizados han profundizado en la comparacin entre una versin de referencia del agente no dotada de ningn comportamiento o caracterstica emocional, y una versin del agente modificada convenientemente con el modelo emocional propuesto. Los resultados obtenidos nos han permitido comprender y valorar mejor la utilidad de las emociones en los sistemas de dilogo hablado. Dicha utilidad depende de varios aspectos. En primer lugar, un SCA no debe expresar sus emociones a ciegas o arbitrariamente, incluso aunque stas sean positivas. Ms bien, debe adaptar sus emociones a los diferentes estados de los usuarios. En segundo lugar, un funcionamiento relativamente pobre por parte de un SCA podra compensarse, en cierto modo, dotando al SCA de comportamiento y capacidades emocionales. En tercer lugar, aprovechar la prosodia como vehculo para expresar las emociones, de manera complementaria al empleo de mensajes con un contenido emocional especfico tanto desde el punto de vista lxico como semntico, ayuda a mejorar la percepcin por parte de los usuarios de un SCA. Tomados en conjunto, los resultados alcanzados no slo confirman el xito del modelo emocional, sino xv que constituyen adems una evidencia decisiva con respecto a los beneficios de incorporar emociones en un SCA, especialmente en cuanto a reducir el nivel de frustracin de los usuarios y, en ltima instancia, mejorar su satisfaccin.
Resumo:
We describe the work on infusion of emotion into limitedtask autonomous spoken conversational agents (SCAs) situated in the domestic environment, using a Need-inspired task-independentEmotion model (NEMO). In order to demonstrate the generation of a?ect through the use of the model, we describe the work of integrating it with a naturallanguage mixed-initiative HiFi-control SCA. NEMO and the host system communicates externally, removing the need for the Dialog Manager to be modi?ed as done in most existing dialog systems, in order to be adaptive. We also summarize the work on automatic a?ect prediction, namely frustration and contentment from dialog features, a non-conventional source, in the attempt of moving towards a more user-centric approach.
Resumo:
Current text-to-speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech-based systems which could cope with spontaneous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on unsupervised speaker diarization. In order to maximize the speaker purity of the clusters while keeping a high speaker coverage, the paper evaluates the F-measure of a diarization module, achieving high scores (>85%) especially when the clusters are longer than 30 seconds, even for the more spontaneous and expressive styles (such as talk shows or sports).
Resumo:
El objetivo del presente proyecto es proporcionar una actividad de la pronunciacin y repaso de vocabulario en lengua inglesa para la plataforma Moodle alojada en la pgina web de Integrated Language Learning Lab (ILLLab). La pgina web ILLLab tiene el objetivo de que los alumnos de la EUIT de Telecomunicacin de la UPM con un nivel de ingls A2 segn el Marco Comn Europeo de Referencia para las Lenguas (MCERL), puedan trabajar de manera autnoma para avanzar hacia el nivel B2 en ingls. La UPM exige estos conocimientos de nivel de ingls para cursar la asignatura English for Professional and Academic Communication (EPAC) de carcter obligatorio e impartida en el sptimo semestre del Grado en Ingeniera de Telecomunicaciones. Asimismo, se persigue abordar el problema de las escasas actividades de expresin oral de las plataformas de autoaprendizaje se dedican a la formacin en idiomas y, ms concretamente, al ingls. Con ese fin, se proporciona una herramienta basada en sistemas de reconocimiento de voz para que el usuario practique la pronunciacin de las palabras inglesas. En el primer captulo del trabajo se introduce la aplicacin Traffic Lights, explicando sus orgenes y en qu consiste. En el segundo captulo se abordan aspectos tericos relacionados con el reconocimiento de voz y se comenta sus funciones principales y las aplicaciones actuales para las que se usa. El tercer captulo ofrece una explicacin detallada de los diferentes lenguajes utilizados para la realizacin del proyecto, as como de su cdigo desarrollado. En el cuarto captulo se plantea un manual de usuario de la aplicacin, exponiendo al usuario cmo funciona la aplicacin y un ejemplo de uso. Adems, se aade varias secciones para el administrador de la aplicacin, en las que se especifica cmo agregar nuevas palabras en la base de datos y hacer cambios en el tiempo estimado que el usuario tiene para acabar una partida del juego. ABSTRACT: The objective of the present project is to provide an activity of pronunciation and vocabulary review in English language within the platform Moodle hosted at the Integrated Language Learning Lab (ILLLab) website. The ILLLab website has the aim to provide students at the EUIT of Telecommunication in the UPM with activities to develop their A2 level according to the Common European Framework of Reference for Languages (CEFR). In the platform, students can work independently to advance towards a B2 level in English. The UPM requires this level of English proficiency for enrolling in the compulsory subject English for Professional and Academic Communication (EPAC) taught in the seventh semester of the Degree in Telecommunications Engineering. Likewise, this project tries to provide alternatives to solve the problem of scarce speaking activities included in the learning platforms that offer language courses, and specifically, English language courses. For this purpose, it provides a tool based on speech recognition systems so that the user can practice the pronunciation of English words. The first chapter of the project introduces the application Traffic Lights, explaining its origins and what it is. The second chapter deals with theoretical aspects related with speech recognition and comments their main features and current applications for which it is generally used. The third chapter provides a detailed explanation of the different programming languages used for the implementation of the project and reviews its code development. The fourth chapter presents an application user manual, exposing to the user how the application works and an example of use. Also, several sections are added addressed to the application administrator, which specify how to add new words to the database and how to make changes in the original stings as could be the estimated time that the user has to finish the game.
Resumo:
We describe the work on infusion of emotion into a limited-task autonomous spoken conversational agent situated in the domestic environment, using a need-inspired task-independent emotion model (NEMO). In order to demonstrate the generation of affect through the use of the model, we describe the work of integrating it with a natural-language mixed-initiative HiFi-control spoken conversational agent (SCA). NEMO and the host system communicate externally, removing the need for the Dialog Manager to be modified, as is done in most existing dialog systems, in order to be adaptive. The first part of the paper concerns the integration between NEMO and the host agent. The second part summarizes the work on automatic affect prediction, namely, frustration and contentment, from dialog features, a non-conventional source, in the attempt of moving towards a more user-centric approach. The final part reports the evaluation results obtained from a user study, in which both versions of the agent (non-adaptive and emotionally-adaptive) were compared. The results provide substantial evidences with respect to the benefits of adding emotion in a spoken conversational agent, especially in mitigating users' frustrations and, ultimately, improving their satisfaction.
Resumo:
Current development platforms for designing spoken dialog services feature different kinds of strategies to help designers build, test, and deploy their applications. In general, these platforms are made up of several assistants that handle the different design stages (e.g. definition of the dialog flow, prompt and grammar definition, database connection, or to debug and test the running of the application). In spite of all the advances in this area, in general the process of designing spoken-based dialog services is a time consuming task that needs to be accelerated. In this paper we describe a complete development platform that reduces the design time by using different types of acceleration strategies based on using information from the data model structure and database contents, as well as cumulative information obtained throughout the successive steps in the design. Thanks to these accelerations, the interaction with the platform is simplified and the design is reduced, in most cases, to simple confirmations to the proposals that the platform automatically provides at each stage. Different kinds of proposals are available to complete the application flow such as the possibility of selecting which information slots should be requested to the user together, predefined templates for common dialogs, the most probable actions that make up each state defined in the flow, different solutions to solve specific speech-modality problems such as the presentation of the lists of retrieved results after querying the backend database. The platform also includes accelerations for creating speech grammars and prompts, and the SQL queries for accessing the database at runtime. Finally, we will describe the setup and results obtained in a simultaneous summative, subjective and objective evaluations with different designers used to test the usability of the proposed accelerations as well as their contribution to reducing the design time and interaction.
Resumo:
Although there has been a lot of interest in recognizing and understanding air traffic control (ATC) speech, none of the published works have obtained detailed field data results. We have developed a system able to identify the language spoken and recognize and understand sentences in both Spanish and English. We also present field results for several in-tower controller positions. To the best of our knowledge, this is the first time that field ATC speech (not simulated) is captured, processed, and analyzed. The use of stochastic grammars allows variations in the standard phraseology that appear in field data. The robust understanding algorithm developed has 95% concept accuracy from ATC text input. It also allows changes in the presentation order of the concepts and the correction of errors created by the speech recognition engine improving it by 17% and 25%, respectively, absolute in the percentage of fully correctly understood sentences for English and Spanish in relation to the percentages of fully correctly recognized sentences. The analysis of errors due to the spontaneity of the speech and its comparison to read speech is also carried out. A 96% word accuracy for read speech is reduced to 86% word accuracy for field ATC data for Spanish for the "clearances" task confirming that field data is needed to estimate the performance of a system. A literature review and a critical discussion on the possibilities of speech recognition and understanding technology applied to ATC speech are also given.
Resumo:
Speech Technologies can provide important benefits for the development of more usable and safe in-vehicle human-machine interactive systems (HMIs). However mainly due robustness issues, the use of spoken interaction can entail important distractions to the driver. In this challenging scenario, while speech technologies are evolving, further research is necessary to explore how they can be complemented with both other modalities (multimodality) and information from the increasing number of available sensors (context-awareness). The perceived quality of speech technologies can significantly be increased by implementing such policies, which simply try to make the best use of all the available resources; and the in vehicle scenario is an excellent test-bed for this kind of initiatives. In this contribution we propose an event-based HMI design framework which combines context modelling and multimodal interaction using a W3C XML language known as SCXML. SCXML provides a general process control mechanism that is being considered by W3C to improve both voice interaction (VoiceXML) and multimodal interaction (MMI). In our approach we try to anticipate and extend these initiatives presenting a flexible SCXML-based approach for the design of a wide range of multimodal context-aware HMI in-vehicle interfaces. The proposed framework for HMI design and specification has been implemented in an automotive OSGi service platform, and it is being used and tested in the Spanish research project MARTA for the development of several in-vehicle interactive applications.
Resumo:
In this paper, we describe a complete development platform that features different innovative acceleration strategies, not included in any other current platform, that simplify and speed up the definition of the different elements required to design a spoken dialog service. The proposed accelerations are mainly based on using the information from the backend database schema and contents, as well as cumulative information produced throughout the different steps in the design. Thanks to these accelerations, the interaction between the designer and the platform is improved, and in most cases the design is reduced to simple confirmations of the proposals that the platform dynamically provides at each step. In addition, the platform provides several other accelerations such as configurable templates that can be used to define the different tasks in the service or the dialogs to obtain or show information to the user, automatic proposals for the best way to request slot contents from the user (i.e. using mixed-initiative forms or directed forms), an assistant that offers the set of more probable actions required to complete the definition of the different tasks in the application, or another assistant for solving specific modality details such as confirmations of user answers or how to present them the lists of retrieved results after querying the backend database. Additionally, the platform also allows the creation of speech grammars and prompts, database access functions, and the possibility of using mixed initiative and over-answering dialogs. In the paper we also describe in detail each assistant in the platform, emphasizing the different kind of methodologies followed to facilitate the design process at each one. Finally, we describe the results obtained in both a subjective and an objective evaluation with different designers that confirm the viability, usefulness, and functionality of the proposed accelerations. Thanks to the accelerations, the design time is reduced in more than 56% and the number of keystrokes by 84%.
Resumo:
The design and development of spoken interaction systems has been a thoroughly studied research scope for the last decades. The aim is to obtain systems with the ability to interact with human agents with a high degree of naturalness and efficiency, allowing them to carry out the actions they desire using speech, as it is the most natural means of communication between humans. To achieve that degree of naturalness, it is not enough to endow systems with the ability to accurately understand the users utterances and to properly react to them, even considering the information provided by the user in his or her previous interactions. The system has also to be aware of the evolution of the conditions under which the interaction takes place, in order to act the most coherent way as possible at each moment. Consequently, one of the most important features of the system is that it has to be context-aware. This context awareness of the system can be reflected in the modification of the behaviour of the system taking into account the current situation of the interaction. For instance, the system should decide which action it has to carry out, or the way to perform it, depending on the user that requests it, on the way that the user addresses the system, on the characteristics of the environment in which the interaction takes place, and so on. In other words, the system has to adapt its behaviour to these evolving elements of the interaction. Moreover that adaptation has to be carried out, if possible, in such a way that the user: i) does not perceive that the system has to make any additional effort, or to devote interaction time to perform tasks other than carrying out the requested actions, and ii) does not have to provide the system with any additional information to carry out the adaptation, which could imply a lesser efficiency of the interaction, since users should devote several interactions only to allow the system to become adapted. In the state-of-the-art spoken dialogue systems, researchers have proposed several disparate strategies to adapt the elements of the system to different conditions of the interaction (such as the acoustic characteristics of a specific users speech, the actions previously requested, and so on). Nevertheless, to our knowledge there is not any consensus on the procedures to carry out these adaptation. The approaches are to an extent unrelated from one another, in the sense that each one considers different pieces of information, and the treatment of that information is different taking into account the adaptation carried out. In this regard, the main contributions of this Thesis are the following ones: Definition of a contextualization framework. We propose a unified approach that can cover any strategy to adapt the behaviour of a dialogue system to the conditions of the interaction (i.e. the context). In our theoretical definition of the contextualization framework we consider the systems context as all the sources of variability present at any time of the interaction, either those ones related to the environment in which the interaction takes place, or to the human agent that addresses the system at each moment. Our proposal relies on three aspects that any contextualization approach should fulfill: plasticity (i.e. the system has to be able to modify its behaviour in the most proactive way taking into account the conditions under which the interaction takes place), adaptivity (i.e. the system has also to be able to consider the most appropriate sources of information at each moment, both environmental and user- and dialogue-dependent, to effectively adapt to the conditions aforementioned), and transparency (i.e. the system has to carry out the contextualizaton-related tasks in such a way that the user neither perceives them nor has to do any effort in providing the system with any information that it needs to perform that contextualization). Additionally, we could include a generality aspect to our proposed framework: the main features of the framework should be easy to adopt in any dialogue system, regardless of the solution proposed to manage the dialogue. Once we define the theoretical basis of our contextualization framework, we propose two cases of study on its application in a spoken dialogue system. We focus on two aspects of the interaction: the contextualization of the speech recognition models, and the incorporation of user-specific information into the dialogue flow. One of the modules of a dialogue system that is more prone to be contextualized is the speech recognition system. This module makes use of several models to emit a recognition hypothesis from the users speech signal. Generally speaking, a recognition system considers two types of models: an acoustic one (that models each of the phonemes that the recognition system has to consider) and a linguistic one (that models the sequences of words that make sense for the system). In this work we contextualize the language model of the recognition system in such a way that it takes into account the information provided by the user in both his or her current utterance and in the previous ones. These utterances convey information useful to help the system in the recognition of the next utterance. The contextualization approach that we propose consists of a dynamic adaptation of the language model that is used by the recognition system. We carry out this adaptation by means of a linear interpolation between several models. Instead of training the best interpolation weights, we make them dependent on the conditions of the dialogue. In our approach, the system itself will obtain these weights as a function of the reliability of the different elements of information available, such as the semantic concepts extracted from the users utterance, the actions that he or she wants to carry out, the information provided in the previous interactions, and so on. One of the aspects more frequently addressed in Human-Computer Interaction research is the inclusion of user specific characteristics into the information structures managed by the system. The idea is to take into account the features that make each user different from the others in order to offer to each particular user different services (or the same service, but in a different way). We could consider this approach as a user-dependent contextualization of the system. In our work we propose the definition of a user model that contains all the information of each user that could be potentially useful to the system at a given moment of the interaction. In particular we will analyze the actions that each user carries out throughout his or her interaction. The objective is to determine which of these actions become the preferences of that user. We represent the specific information of each user as a feature vector. Each of the characteristics that the system will take into account has a confidence score associated. With these elements, we propose a probabilistic definition of a user preference, as the action whose likelihood of being addressed by the user is greater than the one for the rest of actions. To include the user dependent information into the dialogue flow, we modify the information structures on which the dialogue manager relies to retrieve information that could be needed to solve the actions addressed by the user. Usage preferences become another source of contextual information that will be considered by the system towards a more efficient interaction (since the new information source will help to decrease the need of the system to ask users for additional information, thus reducing the number of turns needed to carry out a specific action). To test the benefits of the contextualization framework that we propose, we carry out an evaluation of the two strategies aforementioned. We gather several performance metrics, both objective and subjective, that allow us to compare the improvements of a contextualized system against the baseline one. We will also gather the users opinions as regards their perceptions on the behaviour of the system, and its degree of adaptation to the specific features of each interaction. Resumen El diseo y el desarrollo de sistemas de interaccin hablada ha sido objeto de profundo estudio durante las pasadas dcadas. El propsito es la consecucin de sistemas con la capacidad de interactuar con agentes humanos con un alto grado de eficiencia y naturalidad. De esta manera, los usuarios pueden desempear las tareas que deseen empleando la voz, que es el medio de comunicacin ms natural para los humanos. A fin de alcanzar el grado de naturalidad deseado, no basta con dotar a los sistemas de la abilidad de comprender las intervenciones de los usuarios y reaccionar a ellas de manera apropiada (teniendo en consideracin, incluso, la informacin proporcionada en previas interacciones). Adicionalmente, el sistema ha de ser consciente de las condiciones bajo las cuales transcurre la interaccin, as como de la evolucin de las mismas, de tal manera que pueda actuar de la manera ms coherente en cada instante de la interaccin. En consecuencia, una de las caractersticas primordiales del sistema es que debe ser sensible al contexto. Esta capacidad del sistema de conocer y emplear el contexto de la interaccin puede verse reflejada en la modificacin de su comportamiento debida a las caractersticas actuales de la interaccin. Por ejemplo, el sistema debera decidir cul es la accin ms apropiada, o la mejor manera de llevarla a trmino, dependiendo del usuario que la solicita, del modo en el que lo hace, etctera. En otras palabras, el sistema ha de adaptar su comportamiento a tales elementos mutables (o dinmicos) de la interaccin. Dos caractersticas adicionales son requeridas a dicha adaptacin: i) el usuario no ha de percibir que el sistema dedica recursos (temporales o computacionales) a realizar tareas distintas a las que aqul le solicita, y ii) el usuario no ha de dedicar esfuerzo alguno a proporcionar al sistema informacin adicional para llevar a cabo la interaccin. Esto ltimo implicara una menor eficiencia de la interaccin, puesto que los usuarios deberan dedicar parte de la misma a proporcionar informacin al sistema para su adaptacin, sin ningn beneficio inmediato. En los sistemas de dilogo hablado propuestos en la literatura, se han propuesto diferentes estrategias para llevar a cabo la adaptacin de los elementos del sistema a las diferentes condiciones de la interaccin (tales como las caractersticas acsticas del habla de un usuario particular, o a las acciones a las que se ha referido con anterioridad). Sin embargo, no existe una estrategia fija para proceder a dicha adaptacin, sino que las mismas no suelen guardar una relacin entre s. En este sentido, cada una de ellas tiene en cuenta distintas fuentes de informacin, la cual es tratada de manera diferente en funcin de las caractersticas de la adaptacin buscada. Teniendo en cuenta lo anterior, las contribuciones principales de esta Tesis son las siguientes: Definicin de un marco de contextualizacin. Proponemos un criterio unificador que pueda cubrir cualquier estrategia de adaptacin del comportamiento de un sistema de dilogo a las condiciones de la interaccin (esto es, el contexto de la misma). En nuestra definicin terica del marco de contextualizacin consideramos el contexto del sistema como todas aquellas fuentes de variabilidad presentes en cualquier instante de la interaccin, ya estn relacionadas con el entorno en el que tiene lugar la interaccin, ya dependan del agente humano que se dirige al sistema en cada momento. Nuestra propuesta se basa en tres aspectos que cualquier estrategia de contextualizacin debera cumplir: plasticidad (es decir, el sistema ha de ser capaz de modificar su comportamiento de la manera ms proactiva posible, teniendo en cuenta las condiciones en las que tiene lugar la interaccin), adaptabilidad (esto es, el sistema ha de ser capaz de considerar la informacin oportuna en cada instante, ya dependa del entorno o del usuario, de tal manera que adece su comportamiento de manera eficaz a las condiciones mencionadas), y transparencia (que implica que el sistema ha de desarrollar las tareas relacionadas con la contextualizacin de tal manera que el usuario no perciba la manera en que dichas tareas se llevan a cabo, ni tampoco deba proporcionar al sistema con informacin adicional alguna). De manera adicional, incluiremos en el marco propuesto el aspecto de la generalidad: las caractersticas del marco de contextualizacin han de ser portables a cualquier sistema de dilogo, con independencia de la solucin propuesta en los mismos para gestionar el dilogo. Una vez hemos definido las caractersticas de alto nivel de nuestro marco de contextualizacin, proponemos dos estrategias de aplicacin del mismo a un sistema de dilogo hablado. Nos centraremos en dos aspectos de la interaccin a adaptar: los modelos empleados en el reconocimiento de habla, y la incorporacin de informacin especfica de cada usuario en el flujo de dilogo. Uno de los mdulos de un sistema de dilogo ms susceptible de ser contextualizado es el sistema de reconocimiento de habla. Este mdulo hace uso de varios modelos para generar una hiptesis de reconocimiento a partir de la seal de habla. En general, un sistema de reconocimiento emplea dos tipos de modelos: uno acstico (que modela cada uno de los fonemas considerados por el reconocedor) y uno lingstico (que modela las secuencias de palabras que tienen sentido desde el punto de vista de la interaccin). En este trabajo contextualizamos el modelo lingstico del reconocedor de habla, de tal manera que tenga en cuenta la informacin proporcionada por el usuario, tanto en su intervencin actual como en las previas. Estas intervenciones contienen informacin (semntica y/o discursiva) que puede contribuir a un mejor reconocimiento de las subsiguientes intervenciones del usuario. La estrategia de contextualizacin propuesta consiste en una adaptacin dinmica del modelo de lenguaje empleado en el reconocedor de habla. Dicha adaptacin se lleva a cabo mediante una interpolacin lineal entre diferentes modelos. En lugar de entrenar los mejores pesos de interpolacin, proponemos hacer los mismos dependientes de las condiciones actuales de cada dilogo. El propio sistema obtendr estos pesos como funcin de la disponibilidad y relevancia de las diferentes fuentes de informacin disponibles, tales como los conceptos semnticos extrados a partir de la intervencin del usuario, o las acciones que el mismo desea ejecutar. Uno de los aspectos ms comnmente analizados en la investigacin de la Interaccin Persona-Mquina es la inclusin de las caractersticas especficas de cada usuario en las estructuras de informacin empleadas por el sistema. El objetivo es tener en cuenta los aspectos que diferencian a cada usuario, de tal manera que el sistema pueda ofrecer a cada uno de ellos el servicio ms apropiado (o un mismo servicio, pero de la manera ms adecuada a cada usuario). Podemos considerar esta estrategia como una contextualizacin dependiente del usuario. En este trabajo proponemos la definicin de un modelo de usuario que contenga toda la informacin relativa a cada usuario, que pueda ser potencialmente utilizada por el sistema en un momento determinado de la interaccin. En particular, analizaremos aquellas acciones que cada usuario decide ejecutar a lo largo de sus dilogos con el sistema. Nuestro objetivo es determinar cules de dichas acciones se convierten en las preferencias de cada usuario. La informacin de cada usuario quedar representada mediante un vector de caractersticas, cada una de las cuales tendr asociado un valor de confianza. Con ambos elementos proponemos una definicin probabilstica de una preferencia de uso, como aquella accin cuya verosimilitud es mayor que la del resto de acciones solicitadas por el usuario. A fin de incluir la informacin dependiente de usuario en el flujo de dilogo, llevamos a cabo una modificacin de las estructuras de informacin en las que se apoya el gestor de dilogo para recuperar informacin necesaria para resolver ciertos dilogos. En dicha modificacin las preferencias de cada usuario pasarn a ser una fuente adicional de informacin contextual, que ser tenida en cuenta por el sistema en aras de una interaccin ms eficiente (puesto que la nueva fuente de informacin contribuir a reducir la necesidad del sistema de solicitar al usuario informacin adicional, dando lugar en consecuencia a una reduccin del nmero de intervenciones necesarias para llevar a cabo una accin determinada). Para determinar los beneficios de las aplicaciones del marco de contextualizacin propuesto, llevamos a cabo una evaluacin de un sistema de dilogo que incluye las estrategias mencionadas. Hemos recogido diversas mtricas, tanto objetivas como subjetivas, que nos permiten determinar las mejoras aportadas por un sistema contextualizado en comparacin con el sistema sin contextualizar. De igual manera, hemos recogido las opiniones de los participantes en la evaluacin acerca de su percepcin del comportamiento del sistema, y de su capacidad de adaptacin a las condiciones concretas de cada interaccin.