961 resultados para Automatic speech recognition (ASR)
Resumo:
The purpose of this paper is to survey and assess the state-of-the-art in automatic target recognition for synthetic aperture radar imagery (SAR-ATR). The aim is not to develop an exhaustive survey of the voluminous literature, but rather to capture in one place the various approaches for implementing the SAR-ATR system. This paper is meant to be as self-contained as possible, and it approaches the SAR-ATR problem from a holistic end-to-end perspective. A brief overview for the breadth of the SAR-ATR challenges is conducted. This is couched in terms of a single-channel SAR, and it is extendable to multi-channel SAR systems. Stages pertinent to the basic SAR-ATR system structure are defined, and the motivations of the requirements and constraints on the system constituents are addressed. For each stage in the SAR-ATR processing chain, a taxonomization methodology for surveying the numerous methods published in the open literature is proposed. Carefully selected works from the literature are presented under the taxa proposed. Novel comparisons, discussions, and comments are pinpointed throughout this paper. A two-fold benchmarking scheme for evaluating existing SAR-ATR systems and motivating new system designs is proposed. The scheme is applied to the works surveyed in this paper. Finally, a discussion is presented in which various interrelated issues, such as standard operating conditions, extended operating conditions, and target-model design, are addressed. This paper is a contribution toward fulfilling an objective of end-to-end SAR-ATR system design.
Resumo:
OBJECTIVE: Cochlear implantation (CI) is a standard treatment for severe-profound sensorineural hearing loss (SNHL). However, consensus has yet to be reached on its effectiveness for hearing loss caused by auditory neuropathy spectrum disorder (ANSD). This review aims to summarize and synthesize current evidence of the effectiveness of CI in improving speech recognition in children with ANSD. DESIGN: Systematic review. STUDY SAMPLE: A total of 27 studies from an initial selection of 237. RESULTS: All selected studies were observational in design, including case studies, cohort studies, and comparisons between children with ANSD and SNHL. Most children with ANSD achieved open-set speech recognition with their CI. Speech recognition ability was found to be equivalent in CI users (who previously performed poorly with hearing aids) and hearing-aid users. Outcomes following CI generally appeared similar in children with ANSD and SNHL. Assessment of study quality, however, suggested substantial methodological concerns, particularly in relation to issues of bias and confounding, limiting the robustness of any conclusions around effectiveness. CONCLUSIONS: Currently available evidence is compatible with favourable outcomes from CI in children with ANSD. However, this evidence is weak. Stronger evidence is needed to support cost-effective clinical policy and practice in this area.
Resumo:
The automatic speech recognition by machine has been the target of researchers in the past five decades. In this period have been numerous advances, such as in the field of recognition of isolated words (commands), which has very high rates of recognition, currently. However, we are still far from developing a system that could have a performance similar to the human being (automatic continuous speech recognition). One of the great challenges of searches for continuous speech recognition is the large amount of pattern. The modern languages such as English, French, Spanish and Portuguese have approximately 500,000 words or patterns to be identified. The purpose of this study is to use smaller units than the word such as phonemes, syllables and difones units as the basis for the speech recognition, aiming to recognize any words without necessarily using them. The main goal is to reduce the restriction imposed by the excessive amount of patterns. In order to validate this proposal, the system was tested in the isolated word recognition in dependent-case. The phonemes characteristics of the Brazil s Portuguese language were used to developed the hierarchy decision system. These decisions are made through the use of neural networks SVM (Support Vector Machines). The main speech features used were obtained from the Wavelet Packet Transform. The descriptors MFCC (Mel-Frequency Cepstral Coefficient) are also used in this work. It was concluded that the method proposed in this work, showed good results in the steps of recognition of vowels, consonants (syllables) and words when compared with other existing methods in literature
Resumo:
Sistemas de reconhecimento e sntese de voz so constitudos por mdulos que dependem da lngua e, enquanto existem muitos recursos pblicos para alguns idiomas (p.e. Ingls e Japons), os recursos para Portugus Brasileiro (PB) ainda so escassos. Outro aspecto que, para um grande nmero de tarefas, a taxa de erro dos sistemas de reconhecimento de voz atuais ainda elevada, quando comparada obtida por seres humanos. Assim, apesar do sucesso das cadeias escondidas de Markov (HMM), necessria a pesquisa por novos mtodos. Este trabalho tem como motivao esses dois fatos e se divide em duas partes. A primeira descreve o desenvolvimento de recursos e ferramentas livres para reconhecimento e sntese de voz em PB, consistindo de bases de dados de udio e texto, um dicionrio fontico, um conversor grafema-fone, um separador silbico e modelos acstico e de linguagem. Todos os recursos construdos encontram-se publicamente disponveis e, junto com uma interface de programao proposta, tm sido usados para o desenvolvimento de vrias novas aplicaes em tempo-real, incluindo um mdulo de reconhecimento de voz para a sute de aplicativos para escritrio OpenOffice.org. So apresentados testes de desempenho dos sistemas desenvolvidos. Os recursos aqui produzidos e disponibilizados facilitam a adoo da tecnologia de voz para PB por outros grupos de pesquisa, desenvolvedores e pela indstria. A segunda parte do trabalho apresenta um novo mtodo para reavaliar (rescoring) o resultado do reconhecimento baseado em HMMs, o qual organizado em uma estrutura de dados do tipo lattice. Mais especificamente, o sistema utiliza classificadores discriminativos que buscam diminuir a confuso entre pares de fones. Para cada um desses problemas binrios, so usadas tcnicas de seleo automtica de parmetros para escolher a representaao paramtrica mais adequada para o problema em questo.
Resumo:
O reconhecimento automtico de voz vem sendo cada vez mais til e possvel. Quando se trata de lnguas como a Inglesa, encontram-se no mercado excelentes reconhecedores. Porem, a situao no e a mesma para o Portugus Brasileiro, onde os principais reconhecedores para ditado em sistemas desktop que j existiram foram descontinuados. A presente dissertao alinha-se com os objetivos do Laboratrio de Processamento de Sinais da Universidade Federal do Par, que o desenvolvimento de um reconhecedor automtico de voz para Portugus Brasileiro. Mais especificamente, as principais contribuies dessa dissertao so: o desenvolvimento de alguns recursos necessrios para a construo de um reconhecedor, tais como: bases de udio transcrito e API para desenvolvimento de aplicaes; e o desenvolvimento de duas aplicaes: uma para ditado em sistema desktop e outra para atendimento automtico em um call center. O Coruja, sistema desenvolvido no LaPS para reconhecimento de voz em Portugus Brasileiro. Este alem de conter todos os recursos para fornecer reconhecimento de voz em Portugus Brasileiro possui uma API para desenvolvimento de aplicativos. O aplicativo desenvolvido para ditado e edio de textos em desktop e o SpeechOO, este possibilita o ditado para a ferramenta Writer do pacote LibreOffice, alem de permitir a edio e formatao de texto com comandos de voz. Outra contribuio deste trabalho e a utilizao de reconhecimento automtico de voz em call centers, o Coruja foi integrado ao software Asterisk e a principal aplicao desenvolvida foi uma unidade de resposta audvel com reconhecimento de voz para o atendimento de um call center nacional que atende mais de 3 mil ligaes dirias.
Resumo:
Este trabalho visa propor uma soluo contendo um sistema de reconhecimento de fala automtico em nuvem. Dessa forma, no h necessidade de um reconhecedor sendo executado na prpria mquina cliente, pois o mesmo estar disponvel atravs da Internet. Alm do reconhecimento automtico de voz em nuvem, outra vertente deste trabalho alta disponibilidade. A importncia desse tpico se da porque o ambiente servidor onde se planeja executar o reconhecimento em nuvem no pode ficar indisponvel ao usurio. Dos vrios aspectos que requerem robustez, tal como a prpria conexo de Internet, o escopo desse trabalho foi definido como os softwares livres que permitem a empresas aumentarem a disponibilidade de seus servios. Dentre os resultados alcanados e para as condies simuladas, mostrou-se que o reconhecedor de voz em nuvem desenvolvido pelo grupo atingiu um desempenho prximo ao do Google.
Resumo:
In many movies of scientific fiction, machines were capable of speaking with humans. However mankind is still far away of getting those types of machines, like the famous character C3PO of Star Wars. During the last six decades the automatic speech recognition systems have been the target of many studies. Throughout these years many technics were developed to be used in applications of both software and hardware. There are many types of automatic speech recognition system, among which the one used in this work were the isolated word and independent of the speaker system, using Hidden Markov Models as the recognition system. The goals of this work is to project and synthesize the first two steps of the speech recognition system, the steps are: the speech signal acquisition and the pre-processing of the signal. Both steps were developed in a reprogrammable component named FPGA, using the VHDL hardware description language, owing to the high performance of this component and the flexibility of the language. In this work it is presented all the theory of digital signal processing, as Fast Fourier Transforms and digital filters and also all the theory of speech recognition using Hidden Markov Models and LPC processor. It is also presented all the results obtained for each one of the blocks synthesized e verified in hardware
Resumo:
Crowdsourcing linguistic phenomena with smartphone applications is relatively new. Apps have been used to train acoustic models for automatic speech recognition (de Vries et al. 2014) and to archive endangered languages (Iwaidja Inyaman Team 2012). Leemann and Kolly (2013) developed a free app for iOSDialkt pp (D) (>78k downloads)to document language change in Swiss German. Here, we present results of sound change based on D data. D predicts the users dialects: for 16 variables, users select their dialectal variant. D then tells users which dialect they speak. Underlying this prediction are maps from the Linguistic Atlas of German-speaking Switzerland (SDS, 1962-2003), which documents the linguistic situation around 1950. If predicted wrongly, users indicate their actual dialect. With this information, the 16 variables can be assessed for language change. Results revealed robustness of phonetic variables; lexical and morphological variables were more prone to change. Phonetic variables like to lift (variants: /lupf, lpf, lipf/) revealed SDS agreement scores of nearly 85%, i.e., little sound change. Not all phonetic variables are equally robust: ladle (variants: /xl, xll, xu, x, x/) exhibited significant sound change. We will illustrate the results using maps that show details of the sound changes at hand.
Resumo:
Este trabajo de Tesis ha abordado el objetivo de dar robustez y mejorar la Deteccin de Actividad de Voz en entornos acsticos adversos con el fin de favorecer el comportamiento de muchas aplicaciones vocales, por ejemplo aplicaciones de telefona basadas en reconocimiento automtico de voz, aplicaciones en sistemas de transcripcin automtica, aplicaciones en sistemas multicanal, etc. En especial, aunque se han tenido en cuenta todos los tipos de ruido, se muestra especial inters en el estudio de las voces de fondo, principal fuente de error de la mayora de los Detectores de Actividad en la actualidad. Las tareas llevadas a cabo poseen como punto de partida un Detector de Actividad basado en Modelos Ocultos de Markov, cuyo vector de caractersticas contiene dos componentes: la energa normalizada y la variacin de la energa. Las aportaciones fundamentales de esta Tesis son las siguientes: 1) ampliacin del vector de caractersticas de partida dotndole as de informacin espectral, 2) ajuste de los Modelos Ocultos de Markov al entorno y estudio de diferentes topologas y, finalmente, 3) estudio e inclusin de nuevas caractersticas, distintas de las del punto 1, para filtrar los pulsos de pronunciaciones que proceden de las voces de fondo. Los resultados de deteccin, teniendo en cuenta los tres puntos anteriores, muestran con creces los avances realizados y son significativamente mejores que los resultados obtenidos, bajo las mismas condiciones, con otros detectores de actividad de referencia. This work has been focused on improving the robustness at Voice Activity Detection in adverse acoustic environments in order to enhance the behavior of many vocal applications, for example telephony applications based on automatic speech recognition, automatic transcription applications, multichannel systems applications, and so on. In particular, though all types of noise have taken into account, this research has special interest in the study of pronunciations coming from far-field speakers, the main error source of most activity detectors today. The tasks carried out have, as starting point, a Hidden Markov Models Voice Activity Detector which a feature vector containing two components: normalized energy and delta energy. The key points of this Thesis are the following: 1) feature vector extension providing spectral information, 2) Hidden Markov Models adjustment to environment and study of different Hidden Markov Model topologies and, finally, 3) study and inclusion of new features, different from point 1, to reject the pronunciations coming from far-field speakers. Detection results, taking into account the above three points, show the advantages of using this method and are significantly better than the results obtained under the same conditions by other well-known voice activity detectors.
Resumo:
As the telecommunications industry evolves over the next decade to provide the products and services that people will desire, several key technologies will become commonplace. Two of these, automatic speech recognition and text-to-speech synthesis, will provide users with more freedom on when, where, and how they access information. While these technologies are currently in their infancy, their capabilities are rapidly increasing and their deployment in today's telephone network is expanding. The economic impact of just one application, the automation of operator services, is well over $100 million per year. Yet there still are many technical challenges that must be resolved before these technologies can be deployed ubiquitously in products and services throughout the worldwide telephone network. These challenges include: (i) High level of accuracy. The technology must be perceived by the user as highly accurate, robust, and reliable. (ii) Easy to use. Speech is only one of several possible input/output modalities for conveying information between a human and a machine, much like a computer terminal or Touch-Tone pad on a telephone. It is not the final product. Therefore, speech technologies must be hidden from the user. That is, the burden of using the technology must be on the technology itself. (iii) Quick prototyping and development of new products and services. The technology must support the creation of new products and services based on speech in an efficient and timely fashion. In this paper I present a vision of the voice-processing industry with a focus on the areas with the broadest base of user penetration: speech recognition, text-to-speech synthesis, natural language processing, and speaker recognition technologies. The current and future applications of these technologies in the telecommunications industry will be examined in terms of their strengths, limitations, and the degree to which user needs have been or have yet to be met. Although noteworthy gains have been made in areas with potentially small user bases and in the more mature speech-coding technologies, these subjects are outside the scope of this paper.
Resumo:
This work focuses on Machine Translation (MT) and Speech-to-Speech Translation, two emerging technologies that allow users to automatically translate written and spoken texts. The first part of this work provides a theoretical framework for the evaluation of Google Translate and Microsoft Translator, which is at the core of this study. Chapter one focuses on Machine Translation, providing a definition of this technology and glimpses of its history. In this chapter we will also learn how MT works, who uses it, for what purpose, what its pros and cons are, and how machine translation quality can be defined and assessed. Chapter two deals with Speech-to-Speech Translation by focusing on its history, characteristics and operation, potential uses and limits deriving from the intrinsic difficulty of translating spoken language. After describing the future prospects for SST, the final part of this chapter focuses on the quality assessment of Speech-to-Speech Translation applications. The last part of this dissertation describes the evaluation test carried out on Google Translate and Microsoft Translator, two mobile translation apps also providing a Speech-to-Speech Translation service. Chapter three illustrates the objectives, the research questions, the participants, the methodology and the elaboration of the questionnaires used to collect data. The collected data and the results of the evaluation of the automatic speech recognition subsystem and the language translation subsystem are presented in chapter four and finally analysed and compared in chapter five, which provides a general description of the performance of the evaluated apps and possible explanations for each set of results. In the final part of this work suggestions are made for future research and reflections on the usability and usefulness of the evaluated translation apps are provided.
Resumo:
Negli ultimi anni, l'avanzamento incredibilmente rapido della tecnologia ha portato allo sviluppo e alla diffusione di dispositivi elettronici portatili aventi dimensioni estremamente ridotte e, allo stesso tempo, capacit computazionali molto notevoli. Pi nello specifico, una particolare categoria di dispositivi, attualmente in forte sviluppo, che ha gi fatto la propria comparsa sul mercato mondiale sicuramente la categoria dei dispositivi Wearable. Come suggerisce il nome, questi sono progettati per essere letteralmente indossati, pensati per fornire continuo supporto, in diversi ambiti, a chi li utilizza. Se per interagire con essi lutente non deve ricorrere obbligatoriamente all'utilizzo delle mani, allora si parla di dispositivi Wearable Hands Free. Questi sono generalmente in grado di percepire e catture linput dell'utente seguendo tecniche e metodologie diverse, non basate sul tatto. Una di queste sicuramente quella che prevede di modellare linput dellutente stesso attraverso la sua voce, appoggiandosi alla disciplina dellASR (Automatic Speech Recognition), che si occupa della traduzione del linguaggio parlato in testo, mediante lutilizzo di dispositivi computerizzati. Si giunge quindi allobiettivo della tesi, che quello di sviluppare un framework, utilizzabile nellambito dei dispositivi Wearable, che fornisca un servizio di riconoscimento vocale appoggiandosi ad uno gi esistente, in modo che presenti un certo livello di efficienza e facilit di utilizzo. Pi in generale, in questo documento si punta a fornire una descrizione approfondita di quelli che sono i dispositivi Wearable e Wearable Hands-Free, definendone caratteristiche, criticit e ambiti di utilizzo. Inoltre, lintento quello di illustrare i principi di funzionamento dellAutomatic Speech Recognition per passare poi ad analisi, progettazione e sviluppo del framework appena citato.
Resumo:
This presentation summarizes experience with the automated speech recognition and translation approach realised in the context of the European project EMMA.
Resumo:
This thesis examines the state of audiovisual translation (AVT) in the aftermath of the COVID-19 emergency, highlighting new trends with regards to the implementation of AI technologies as well as their strengths, constraints, and ethical implications. It starts with an overview of the current AVT landscape, focusing on future projections about its evolution and its critical aspects such as the worsening working conditions lamented by AVT professionals especially freelancers in recent years and how they might be affected by the advent of AI technologies in the industry. The second chapter delves into the history and development of three AI technologies which are used in combination with neural machine translation in automatic AVT tools: automatic speech recognition, speech synthesis and deepfakes (voice cloning and visual deepfakes for lip syncing), including real examples of start-up companies that utilize them or are planning to do so to localize audiovisual content automatically or semi-automatically. The third chapter explores the many ethical concerns around these innovative technologies, which extend far beyond the field of translation; at the same time, it attempts to revindicate their potential to bring about immense progress in terms of accessibility and international cooperation, provided that their use is properly regulated. Lastly, the fourth chapter describes two experiments, testing the efficacy of the currently available tools for automatic subtitling and automatic dubbing respectively, in order to take a closer look at their perks and limitations compared to more traditional approaches. This analysis aims to help discerning legitimate concerns from unfounded speculations with regards to the AI technologies which are entering the field of AVT; the intention behind it is to humbly suggest a constructive and optimistic view of the technological transformations that appear to be underway, whilst also acknowledging their potential risks.
Resumo:
Throughout the years, technology has had an undeniable impact on the AVT field. It has revolutionized the way audiovisual content is consumed by allowing audiences to easily access it at any time and on any device. Especially after the introduction of OTT streaming platforms such as Netflix, Amazon Prime Video, Disney+, Apple TV+, and HBO Max, which offer a vast catalog of national and international products, the consumption of audiovisual products has been on a constant rise and, consequently, the demand for localized content too. In turn, the AVT industry resorts to new technologies and practices to handle the ever-growing workload and the faster turnaround times. Due to the numerous implications that it has on the industry, technological advancement can be considered an area of research of particular interest for the AVT studies. However, in the case of dubbing, research and discussion regarding the topic is lagging behind because of the more limited impact that technology has had on the very conservative dubbing industry. Therefore, the aim of the dissertation is to offer an overview of some of the latest technological innovations and practices that have already been implemented (i.e. cloud dubbing and DeepDub technology) or that are still under development and research (i.e. automatic speech recognition and respeaking, machine translation and post-editing, audio-based and visual-based dubbing techniques, text-based editing of talking-head videos, and automatic dubbing), and respectively discuss their reception by the industry professionals, and make assumptions about their future implementation in the dubbing field.