978 resultados para Speech Dialog Systems
Resumo:
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaptation may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequences. 2010 IEEE.
Resumo:
We present a system for keyword search on Cantonese conversational telephony audio, collected for the IARPA Babel program, that achieves good performance by combining postings lists produced by diverse speech recognition systems from three different research groups. We describe the keyword search task, the data on which the work was done, four different speech recognition systems, and our approach to system combination for keyword search. We show that the combination of four systems outperforms the best single system by 7%, achieving an actual term-weighted value of 0.517. 2013 IEEE.
Resumo:
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance. 2013 IEEE.
Resumo:
Los continuos avances en el desarrollo de tecnologas de la informacin han dado lugar actualmente a la posibilidad de acceder a los contenidos educativos desde cualquier lugar, en cualquier momento y de forma casi instantnea. Sin embargo, la accesibilidad no es siempre considerada como criterio principal en el diseo de aplicaciones educativas, especialmente para facilitar su utilizacin por parte de personas con discapacidad. Diferentes tecnologas han surgido recientemente para fomentar la accesibilidad a las nuevas tecnologas y dispositivos mviles, favoreciendo una comunicacin ms natural con los sistemas educativos. En este artculo se describe el uso innovador de los sistemas de dilogo multimodales en el campo de la educacin, con un especial nfasis en la descripcin de las ventajas que ofrecen para la creacin de aplicaciones educativas inclusivas y adaptadas a la evolucin de los estudiantes.
Resumo:
The applications of Automatic Vowel Recognition (AVR), which is a sub-part of fundamental importance in most of the speech processing systems, vary from automatic interpretation of spoken language to biometrics. State-of-the-art systems for AVR are based on traditional machine learning models such as Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), however, such classifiers can not deal with efficiency and effectiveness at the same time, existing a gap to be explored when real-time processing is required. In this work, we present an algorithm for AVR based on the Optimum-Path Forest (OPF), which is an emergent pattern recognition technique recently introduced in literature. Adopting a supervised training procedure and using speech tags from two public datasets, we observed that OPF has outperformed ANNs, SVMs, plus other classifiers, in terms of training time and accuracy. 2010 IEEE.
Resumo:
Este trabalho apresenta o aplicativo DialogBuilder, uma ferramenta de cdigo aberto escrita em Java que disponibiliza ao usurio uma interface para projeto de sistemas de dilogos e exportao destes para implantao no software Asterisk, o mais popular framework VoIP. O DialogBuilder disponibiliza um wizard para que o usurio leigo possa projetar seu sistema sem precisar aprender a programar para Asterisk. O software separa a fase de concepo do dilogo de sua codificao e se posiciona para tornar tcnica e economicamente vivel, mesmo para pequenas empresas, construir e manter sistemas de dilogo para aplicaes telefnicas.
Resumo:
In many movies of scientific fiction, machines were capable of speaking with humans. However mankind is still far away of getting those types of machines, like the famous character C3PO of Star Wars. During the last six decades the automatic speech recognition systems have been the target of many studies. Throughout these years many technics were developed to be used in applications of both software and hardware. There are many types of automatic speech recognition system, among which the one used in this work were the isolated word and independent of the speaker system, using Hidden Markov Models as the recognition system. The goals of this work is to project and synthesize the first two steps of the speech recognition system, the steps are: the speech signal acquisition and the pre-processing of the signal. Both steps were developed in a reprogrammable component named FPGA, using the VHDL hardware description language, owing to the high performance of this component and the flexibility of the language. In this work it is presented all the theory of digital signal processing, as Fast Fourier Transforms and digital filters and also all the theory of speech recognition using Hidden Markov Models and LPC processor. It is also presented all the results obtained for each one of the blocks synthesized e verified in hardware
Resumo:
It is easy to get frustrated at spoken conversational agents (SCAs), perhaps because they seem to be callous. By and large, the quality of human-computer interaction is affected due to the inability of the SCAs to recognise and adapt to user emotional state. Now with the mass appeal of artificially-mediated communication, there has been an increasing need for SCAs to be socially and emotionally intelligent, that is, to infer and adapt to their human interlocutors emotions on the fly, in order to ascertain an affective, empathetic and naturalistic interaction. An enhanced quality of interaction would reduce users frustrations and consequently increase their satisfactions. These reasons have motivated the development of SCAs towards including socio-emotional elements, turning them into affective and socially-sensitive interfaces. One barrier to the creation of such interfaces has been the lack of methods for modelling emotions in a task-independent environment. Most emotion models for spoken dialog systems are task-dependent and thus cannot be used as-is in different applications. This Thesis focuses on improving this, in which it concerns computational modeling of emotion, personality and their interrelationship for task-independent autonomous SCAs. The generation of emotion is driven by needs, inspired by humans motivational systems. The work in this Thesis is organised in three stages, each one with its own contribution. The first stage involved defining, integrating and quantifying the psychological-based motivational and emotional models sourced from. Later these were transformed into a computational model by implementing them into software entities. The computational model was then incorporated and put to test with an existing SCA host, a HiFi-control agent. The second stage concerned automatic prediction of affect, which has been the main challenge towards the greater aim of infusing social intelligence into the HiFi agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labelling and machine prediction. In this stage, we attempted to address part of this challenge by considering the roles of user satisfaction ratings and conversational/dialog features as the respective target and predictors in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. The final stage concerned the evaluation of the emotional model through the HiFi agent. A series of user studies with 70 subjects were conducted in a real-time environment, each in a different phase and with its own conditions. All the studies involved the comparisons between the baseline non-modified and the modified agent. The findings have gone some way towards enhancing our understanding of the utility of emotion in spoken dialog systems in several ways; first, an SCA should not express its emotions blindly, albeit positive. Rather, it should adapt its emotions to user states. Second, low performance in an SCA may be compensated by the exploitation of emotion. Third, the expression of emotion through the exploitation of prosody could better improve users perceptions of an SCA compared to exploiting emotions through just lexical contents. Taken together, these findings not only support the success of the emotional model, but also provide substantial evidences with respect to the benefits of adding emotion in an SCA, especially in mitigating users frustrations and ultimately improving their satisfactions. Resumen Es relativamente fcil experimentar cierta frustracin al interaccionar con agentes conversacionales (Spoken Conversational Agents, SCA), a menudo porque parecen ser un poco insensibles. En general, la calidad de la interaccin persona-agente se ve en cierto modo afectada por la incapacidad de los SCAs para identificar y adaptarse al estado emocional de sus usuarios. Actualmente, y debido al creciente atractivo e inters de dichos agentes, surge la necesidad de hacer de los SCAs unos seres cada vez ms sociales y emocionalmente inteligentes, es decir, con capacidad para inferir y adaptarse a las emociones de sus interlocutores humanos sobre la marcha, de modo que la interaccin resulte ms afectiva, emptica y, en definitiva, natural. Una interaccin mejorada en este sentido permitira reducir la posible frustracin de los usuarios y, en consecuencia, mejorar el nivel de satisfaccin alcanzado por los mismos. Estos argumentos justifican y motivan el desarrollo de nuevos SCAs con capacidades socio-emocionales, dotados de interfaces afectivas y socialmente sensibles. Una de las barreras para la creacin de tales interfaces ha sido la falta de mtodos de modelado de emociones en entornos independientes de tarea. La mayora de los modelos emocionales empleados por los sistemas de dilogo hablado actuales son dependientes de tarea y, por tanto, no pueden utilizarse "tal cual" en diferentes dominios o aplicaciones. Esta tesis se centra precisamente en la mejora de este aspecto, la definicin de modelos computacionales de las emociones, la personalidad y su interrelacin para SCAs autnomos e independientes de tarea. Inspirada en los sistemas motivacionales humanos en el mbito de la psicologa, la tesis propone un modelo de generacin/produccin de la emocin basado en necesidades. El trabajo realizado en la presente tesis est organizado en tres etapas diferenciadas, cada una con su propia contribucin. La primera etapa incluy la definicin, integracin y cuantificacin de los modelos motivacionales de partida y de los modelos emocionales derivados a partir de stos. Posteriormente, dichos modelos emocionales fueron plasmados en un modelo computacional mediante su implementacin software. Este modelo computacional fue incorporado y probado en un SCA anfitrin ya existente, un agente con capacidad para controlar un equipo HiFi, de alta fidelidad. La segunda etapa se orient hacia el reconocimiento automtico de la emocin, aspecto que ha constituido el principal desafo en relacin al objetivo mayor de infundir inteligencia social en el agente HiFi. En los ltimos aos, los estudios sobre reconocimiento de emociones a partir de la voz han pasado de emplear datos actuados a usar datos reales en los que la presencia u observacin de emociones se produce de una manera mucho ms sutil. El reconocimiento de emociones bajo estas condiciones resulta mucho ms complicado y esta dificultad se pone de manifiesto en tareas tales como el etiquetado y el aprendizaje automtico. En esta etapa, se abord el problema del reconocimiento de las emociones del usuario a partir de caractersticas o mtricas derivadas del propio dilogo usuario-agente. Gracias a dichas mtricas, empleadas como predictores o indicadores del grado o nivel de satisfaccin alcanzado por el usuario, fue posible discriminar entre satisfaccin y frustracin, las dos emociones prevalentes durante la interaccin usuario-agente. La etapa final corresponde fundamentalmente a la evaluacin del modelo emocional por medio del agente Hifi. Con ese propsito se llev a cabo una serie de estudios con usuarios reales, 70 sujetos, interaccionando con diferentes versiones del agente Hifi en tiempo real, cada uno en una fase diferente y con sus propias caractersticas o capacidades emocionales. En particular, todos los estudios realizados han profundizado en la comparacin entre una versin de referencia del agente no dotada de ningn comportamiento o caracterstica emocional, y una versin del agente modificada convenientemente con el modelo emocional propuesto. Los resultados obtenidos nos han permitido comprender y valorar mejor la utilidad de las emociones en los sistemas de dilogo hablado. Dicha utilidad depende de varios aspectos. En primer lugar, un SCA no debe expresar sus emociones a ciegas o arbitrariamente, incluso aunque stas sean positivas. Ms bien, debe adaptar sus emociones a los diferentes estados de los usuarios. En segundo lugar, un funcionamiento relativamente pobre por parte de un SCA podra compensarse, en cierto modo, dotando al SCA de comportamiento y capacidades emocionales. En tercer lugar, aprovechar la prosodia como vehculo para expresar las emociones, de manera complementaria al empleo de mensajes con un contenido emocional especfico tanto desde el punto de vista lxico como semntico, ayuda a mejorar la percepcin por parte de los usuarios de un SCA. Tomados en conjunto, los resultados alcanzados no slo confirman el xito del modelo emocional, sino xv que constituyen adems una evidencia decisiva con respecto a los beneficios de incorporar emociones en un SCA, especialmente en cuanto a reducir el nivel de frustracin de los usuarios y, en ltima instancia, mejorar su satisfaccin.
Resumo:
We describe the work on infusion of emotion into limitedtask autonomous spoken conversational agents (SCAs) situated in the domestic environment, using a Need-inspired task-independentEmotion model (NEMO). In order to demonstrate the generation of a?ect through the use of the model, we describe the work of integrating it with a naturallanguage mixed-initiative HiFi-control SCA. NEMO and the host system communicates externally, removing the need for the Dialog Manager to be modi?ed as done in most existing dialog systems, in order to be adaptive. We also summarize the work on automatic a?ect prediction, namely frustration and contentment from dialog features, a non-conventional source, in the attempt of moving towards a more user-centric approach.
Resumo:
Current text-to-speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech-based systems which could cope with spontaneous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on unsupervised speaker diarization. In order to maximize the speaker purity of the clusters while keeping a high speaker coverage, the paper evaluates the F-measure of a diarization module, achieving high scores (>85%) especially when the clusters are longer than 30 seconds, even for the more spontaneous and expressive styles (such as talk shows or sports).
Resumo:
El objetivo del presente proyecto es proporcionar una actividad de la pronunciacin y repaso de vocabulario en lengua inglesa para la plataforma Moodle alojada en la pgina web de Integrated Language Learning Lab (ILLLab). La pgina web ILLLab tiene el objetivo de que los alumnos de la EUIT de Telecomunicacin de la UPM con un nivel de ingls A2 segn el Marco Comn Europeo de Referencia para las Lenguas (MCERL), puedan trabajar de manera autnoma para avanzar hacia el nivel B2 en ingls. La UPM exige estos conocimientos de nivel de ingls para cursar la asignatura English for Professional and Academic Communication (EPAC) de carcter obligatorio e impartida en el sptimo semestre del Grado en Ingeniera de Telecomunicaciones. Asimismo, se persigue abordar el problema de las escasas actividades de expresin oral de las plataformas de autoaprendizaje se dedican a la formacin en idiomas y, ms concretamente, al ingls. Con ese fin, se proporciona una herramienta basada en sistemas de reconocimiento de voz para que el usuario practique la pronunciacin de las palabras inglesas. En el primer captulo del trabajo se introduce la aplicacin Traffic Lights, explicando sus orgenes y en qu consiste. En el segundo captulo se abordan aspectos tericos relacionados con el reconocimiento de voz y se comenta sus funciones principales y las aplicaciones actuales para las que se usa. El tercer captulo ofrece una explicacin detallada de los diferentes lenguajes utilizados para la realizacin del proyecto, as como de su cdigo desarrollado. En el cuarto captulo se plantea un manual de usuario de la aplicacin, exponiendo al usuario cmo funciona la aplicacin y un ejemplo de uso. Adems, se aade varias secciones para el administrador de la aplicacin, en las que se especifica cmo agregar nuevas palabras en la base de datos y hacer cambios en el tiempo estimado que el usuario tiene para acabar una partida del juego. ABSTRACT: The objective of the present project is to provide an activity of pronunciation and vocabulary review in English language within the platform Moodle hosted at the Integrated Language Learning Lab (ILLLab) website. The ILLLab website has the aim to provide students at the EUIT of Telecommunication in the UPM with activities to develop their A2 level according to the Common European Framework of Reference for Languages (CEFR). In the platform, students can work independently to advance towards a B2 level in English. The UPM requires this level of English proficiency for enrolling in the compulsory subject English for Professional and Academic Communication (EPAC) taught in the seventh semester of the Degree in Telecommunications Engineering. Likewise, this project tries to provide alternatives to solve the problem of scarce speaking activities included in the learning platforms that offer language courses, and specifically, English language courses. For this purpose, it provides a tool based on speech recognition systems so that the user can practice the pronunciation of English words. The first chapter of the project introduces the application Traffic Lights, explaining its origins and what it is. The second chapter deals with theoretical aspects related with speech recognition and comments their main features and current applications for which it is generally used. The third chapter provides a detailed explanation of the different programming languages used for the implementation of the project and reviews its code development. The fourth chapter presents an application user manual, exposing to the user how the application works and an example of use. Also, several sections are added addressed to the application administrator, which specify how to add new words to the database and how to make changes in the original stings as could be the estimated time that the user has to finish the game.
Resumo:
We describe the work on infusion of emotion into a limited-task autonomous spoken conversational agent situated in the domestic environment, using a need-inspired task-independent emotion model (NEMO). In order to demonstrate the generation of affect through the use of the model, we describe the work of integrating it with a natural-language mixed-initiative HiFi-control spoken conversational agent (SCA). NEMO and the host system communicate externally, removing the need for the Dialog Manager to be modified, as is done in most existing dialog systems, in order to be adaptive. The first part of the paper concerns the integration between NEMO and the host agent. The second part summarizes the work on automatic affect prediction, namely, frustration and contentment, from dialog features, a non-conventional source, in the attempt of moving towards a more user-centric approach. The final part reports the evaluation results obtained from a user study, in which both versions of the agent (non-adaptive and emotionally-adaptive) were compared. The results provide substantial evidences with respect to the benefits of adding emotion in a spoken conversational agent, especially in mitigating users' frustrations and, ultimately, improving their satisfaction.
Resumo:
Thesis--University of Illinois at Urbana-Champaign.
Resumo:
Thesis (M.A.)--University of Illinois at Urbana-Champaign.
Resumo:
Primary objective: The aims of this preliminary study were to explore the suitability for and benefits of commencing dysarthria treatment for people with traumatic brain injury (TBI) while in post-traumatic amnesia ( PTA). It was hypothesized that behaviours in PTA don't preclude participation and dysarthria characteristics would improve post-treatment. Research design: A series of comprehensive case analyses. Methods and procedures: Two participants with severe TBI received dysarthria treatment focused on motor speech deficits until emergence from PTA. A checklist of neurobehavioural sequelae of TBI was rated during therapy and perceptual and motor speech assessments were administered before and after therapy. Main outcomes and results: Results revealed that certain behaviours affected the quality of therapy but didn't preclude the provision of therapy. Treatment resulted in physiological improvements in some speech sub-systems for both participants, with varying functional speech outcomes. Conclusions: These findings suggest that dysarthria treatment can begin and provide short-term benefits to speech production during the late stages of PTA post-TBI.