914 resultados para Speech Recognition System using LPC
Resumo:
Descreve a implementação de um software de reconhecimento de voz para o Português Brasileiro. Dentre os objetivos do trabalho tem-se a construção de um sistema de voz contínua para grandes vocabulários, apto a ser usado em aplicações em tempo-real. São apresentados os principais conceitos e características de tais sistemas, além de todos os passos necessários para construção. Como parte desse trabalho foram produzidos e disponibilizados vários recursos: modelos acústicos e de linguagem, novos corpora de voz e texto. O corpus de texto vem sendo construído através da extração e formatação automática de textos de jornais na Internet. Além disso, foram produzidos dois corpora de voz, um baseado em audiobooks e outro produzido especificamente para simular testes em tempo-real. O trabalho também propõe a utilização de técnicas de adaptação de locutor para resolução de problemas de descasamento acústico entre corpora de voz. Por último, é apresentada uma interface de programação de aplicativos que busca facilitar a utilização do decodificador Julius. Testes de desempenho são apresentados, comparando os sistemas desenvolvidos e um software comercial.
Resumo:
Para compor um sistema de Reconhecimento Automático de Voz, pode ser utilizada uma tarefa chamada Classificação Fonética, onde a partir de uma amostra de voz decide-se qual fonema foi emitido por um interlocutor. Para facilitar a classificação e realçar as características mais marcantes dos fonemas, normalmente, as amostras de voz são pré- processadas através de um fronl-en'L Um fron:-end, geralmente, extrai um conjunto de parâmetros para cada amostra de voz. Após este processamento, estes parâmetros são insendos em um algoritmo classificador que (já devidamente treinado) procurará decidir qual o fonema emitido. Existe uma tendência de que quanto maior a quantidade de parâmetros utilizados no sistema, melhor será a taxa de acertos na classificação. A contrapartida para esta tendência é o maior custo computacional envolvido. A técnica de Seleção de Parâmetros tem como função mostrar quais os parâmetros mais relevantes (ou mais utilizados) em uma tarefa de classificação, possibilitando, assim, descobrir quais os parâmetros redundantes, que trazem pouca (ou nenhuma) contribuição à tarefa de classificação. A proposta deste trabalho é aplicar o classificador SVM à classificação fonética, utilizando a base de dados TIMIT, e descobrir os parâmetros mais relevantes na classificação, aplicando a técnica Boosting de Seleção de Parâmetros.
Resumo:
O reconhecimento automático de voz vem sendo cada vez mais útil e possível. Quando se trata de línguas como a Inglesa, encontram-se no mercado excelentes reconhecedores. Porem, a situação não e a mesma para o Português Brasileiro, onde os principais reconhecedores para ditado em sistemas desktop que já existiram foram descontinuados. A presente dissertação alinha-se com os objetivos do Laboratório de Processamento de Sinais da Universidade Federal do Pará, que é o desenvolvimento de um reconhecedor automático de voz para Português Brasileiro. Mais especificamente, as principais contribuições dessa dissertação são: o desenvolvimento de alguns recursos necessários para a construção de um reconhecedor, tais como: bases de áudio transcrito e API para desenvolvimento de aplicações; e o desenvolvimento de duas aplicações: uma para ditado em sistema desktop e outra para atendimento automático em um call center. O Coruja, sistema desenvolvido no LaPS para reconhecimento de voz em Português Brasileiro. Este alem de conter todos os recursos para fornecer reconhecimento de voz em Português Brasileiro possui uma API para desenvolvimento de aplicativos. O aplicativo desenvolvido para ditado e edição de textos em desktop e o SpeechOO, este possibilita o ditado para a ferramenta Writer do pacote LibreOffice, alem de permitir a edição e formatação de texto com comandos de voz. Outra contribuição deste trabalho e a utilização de reconhecimento automático de voz em call centers, o Coruja foi integrado ao software Asterisk e a principal aplicação desenvolvida foi uma unidade de resposta audível com reconhecimento de voz para o atendimento de um call center nacional que atende mais de 3 mil ligações diárias.
Resumo:
Pós-graduação em Engenharia Mecânica - FEG
Resumo:
The aim of this study was to assess the cleaning capacity of the Protaper system using motor-driven or manual instrumentation. Materials and Methods: Ten mandibular molars were randomly separated into 2 groups (n = 5) according to the type of instrumentation performed, as follows: Group 1 - instrumentation with rotary nickel-titanium (Ni-Ti) files using ProTaper Universal System (Dentsply/Maillefer); and, Group 2 - instrumentation with Ni-Ti hand files using ProTaper Universal (Dentsply-Maillefer). Afterwards, the teeth were sectioned transversely and submitted to histotechnical processing to obtain histological sections for microscopic evaluation. The images were analyzed by the Corel Photo-Paint X5 program (Corel Corporation) using an integration grid superimposed on the image. Results: Statistical analysis (U-Mann-Whitney - P < 0.05) demonstrated that G1 presented higher cleaning capacity when compared to G2. Conclusions: The rotary technique presented better cleaning results in the apical third of the root canal system when compared to the manual technique.
Resumo:
To report the audiological outcomes of cochlear implantation in two patients with severe to profound sensorineural hearing loss secondary to superficial siderosis of the CNS and discuss some programming peculiarities that were found in these cases. Retrospective review. Data concerning clinical presentation, diagnosis and audiological assessment pre- and post-implantation were collected of two patients with superficial siderosis of the CNS. Both patients showed good hearing thresholds but variable speech perception outcomes. One patient did not achieve open-set speech recognition, but the other achieved 70% speech recognition in quiet. Electrical compound action potentials could not be elicited in either patient. Map parameters showed the need for increased charge. Electrode impedances showed high longitudinal variability. The implants were fairly beneficial in restoring hearing and improving communication abilities although many reprogramming sessions have been required. The hurdle in programming was the need of frequent adjustments due to the physiologic variations in electrical discharges and neural conduction, besides the changes in the impedances. Patients diagnosed with superficial siderosis may achieve limited results in speech perception scores due to both cochlear and retrocochlear reasons. Careful counseling about the results must be given to the patients and their families before the cochlear implantation indication.
Resumo:
The human eye is sensitive to visible light. Increasing illumination on the eye causes the pupil of the eye to contract, while decreasing illumination causes the pupil to dilate. Visible light causes specular reflections inside the iris ring. On the other hand, the human retina is less sensitive to near infra-red (NIR) radiation in the wavelength range from 800 nm to 1400 nm, but iris detail can still be imaged with NIR illumination. In order to measure the dynamic movement of the human pupil and iris while keeping the light-induced reflexes from affecting the quality of the digitalized image, this paper describes a device based on the consensual reflex. This biological phenomenon contracts and dilates the two pupils synchronously when illuminating one of the eyes by visible light. In this paper, we propose to capture images of the pupil of one eye using NIR illumination while illuminating the other eye using a visible-light pulse. This new approach extracts iris features called "dynamic features (DFs)." This innovative methodology proposes the extraction of information about the way the human eye reacts to light, and to use such information for biometric recognition purposes. The results demonstrate that these features are discriminating features, and, even using the Euclidean distance measure, an average accuracy of recognition of 99.1% was obtained. The proposed methodology has the potential to be "fraud-proof," because these DFs can only be extracted from living irises.
Resumo:
Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.
Resumo:
(31)P MRS magnetization transfer ((31)P-MT) experiments allow the estimation of exchange rates of biochemical reactions, such as the creatine kinase equilibrium and adenosine triphosphate (ATP) synthesis. Although various (31)P-MT methods have been successfully used on isolated organs or animals, their application on humans in clinical scanners poses specific challenges. This study compared two major (31)P-MT methods on a clinical MR system using heteronuclear surface coils. Although saturation transfer (ST) is the most commonly used (31)P-MT method, sequences such as inversion transfer (IT) with short pulses might be better suited for the specific hardware and software limitations of a clinical scanner. In addition, small NMR-undetectable metabolite pools can transfer MT to NMR-visible pools during long saturation pulses, which is prevented with short pulses. (31)P-MT sequences were adapted for limited pulse length, for heteronuclear transmit-receive surface coils with inhomogeneous B1 , for the need for volume selection and for the inherently low signal-to-noise ratio (SNR) on a clinical 3-T MR system. The ST and IT sequences were applied to skeletal muscle and liver in 10 healthy volunteers. Monte-Carlo simulations were used to evaluate the behavior of the IT measurements with increasing imperfections. In skeletal muscle of the thigh, ATP synthesis resulted in forward reaction constants (k) of 0.074 ± 0.022 s(-1) (ST) and 0.137 ± 0.042 s(-1) (IT), whereas the creatine kinase reaction yielded 0.459 ± 0.089 s(-1) (IT). In the liver, ATP synthesis resulted in k = 0.267 ± 0.106 s(-1) (ST), whereas the IT experiment yielded no consistent results. ST results were close to literature values; however, the IT results were either much larger than the corresponding ST values and/or were widely scattered. To summarize, ST and IT experiments can both be implemented on a clinical body scanner with heteronuclear transmit-receive surface coils; however, ST results are much more robust against experimental imperfections than the current implementation of IT.
Resumo:
OBJECTIVE To evaluate the speech intelligibility in noise with a new cochlear implant (CI) processor that uses a pinna effect imitating directional microphone system. STUDY DESIGN Prospective experimental study. SETTING Tertiary referral center. PATIENTS Ten experienced, unilateral CI recipients with bilateral severe-to-profound hearing loss. INTERVENTION All participants performed speech in noise tests with the Opus 2 processor (omnidirectional microphone mode only) and the newer Sonnet processor (omnidirectional and directional microphone mode). MAIN OUTCOME MEASURE The speech reception threshold (SRT) in noise was measured in four spatial settings. The test sentences were always presented from the front. The noise was arriving either from the front (S0N0), the ipsilateral side of the CI (S0NIL), the contralateral side of the CI (S0NCL), or the back (S0N180). RESULTS The directional mode improved the SRTs by 3.6 dB (p < 0.01), 2.2 dB (p < 0.01), and 1.3 dB (p < 0.05) in the S0N180, S0NIL, and S0NCL situations, when compared with the Sonnet in the omnidirectional mode. There was no statistically significant difference in the S0N0 situation. No differences between the Opus 2 and the Sonnet in the omnidirectional mode were observed. CONCLUSION Speech intelligibility with the Sonnet system was statistically different to speech recognition with the Opus 2 system suggesting that CI users might profit from the pinna effect imitating directionality mode in noisy environments.
Resumo:
Although there has been a lot of interest in recognizing and understanding air traffic control (ATC) speech, none of the published works have obtained detailed field data results. We have developed a system able to identify the language spoken and recognize and understand sentences in both Spanish and English. We also present field results for several in-tower controller positions. To the best of our knowledge, this is the first time that field ATC speech (not simulated) is captured, processed, and analyzed. The use of stochastic grammars allows variations in the standard phraseology that appear in field data. The robust understanding algorithm developed has 95% concept accuracy from ATC text input. It also allows changes in the presentation order of the concepts and the correction of errors created by the speech recognition engine improving it by 17% and 25%, respectively, absolute in the percentage of fully correctly understood sentences for English and Spanish in relation to the percentages of fully correctly recognized sentences. The analysis of errors due to the spontaneity of the speech and its comparison to read speech is also carried out. A 96% word accuracy for read speech is reduced to 86% word accuracy for field ATC data for Spanish for the "clearances" task confirming that field data is needed to estimate the performance of a system. A literature review and a critical discussion on the possibilities of speech recognition and understanding technology applied to ATC speech are also given.
Resumo:
This article evaluates an authentication technique for mobiles based on gestures. Users create a remindful identifying gesture to be considered as their in-air signature. This work analyzes a database of 120 gestures of different vulnerability, obtaining an Equal Error Rate (EER) of 9.19% when robustness of gestures is not verified. Most of the errors in this EER come from very simple and easily forgeable gestures that should be discarded at enrollment phase. Therefore, an in-air signature robustness verification system using Linear Discriminant Analysis is proposed to infer automatically whether the gesture is secure or not. Different configurations have been tested obtaining a lowest EER of 4.01% when 45.02% of gestures were discarded, and an optimal compromise of EER of 4.82% when 19.19% of gestures were automatically rejected.
Resumo:
This paper proposes an architecture, based on statistical machine translation, for developing the text normalization module of a text to speech conversion system. The main target is to generate a language independent text normalization module, based on data and flexible enough to deal with all situa-tions presented in this task. The proposed architecture is composed by three main modules: a tokenizer module for splitting the text input into a token graph (tokenization), a phrase-based translation module (token translation) and a post-processing module for removing some tokens. This paper presents initial exper-iments for numbers and abbreviations. The very good results obtained validate the proposed architecture.
Resumo:
We present an approach to adapt dynamically the language models (LMs) used by a speech recognizer that is part of a spoken dialogue system. We have developed a grammar generation strategy that automatically adapts the LMs using the semantic information that the user provides (represented as dialogue concepts), together with the information regarding the intentions of the speaker (inferred by the dialogue manager, and represented as dialogue goals). We carry out the adaptation as a linear interpolation between a background LM, and one or more of the LMs associated to the dialogue elements (concepts or goals) addressed by the user. The interpolation weights between those models are automatically estimated on each dialogue turn, using measures such as the posterior probabilities of concepts and goals, estimated as part of the inference procedure to determine the actions to be carried out. We propose two approaches to handle the LMs related to concepts and goals. Whereas in the first one we estimate a LM for each one of them, in the second one we apply several clustering strategies to group together those elements that share some common properties, and estimate a LM for each cluster. Our evaluation shows how the system can estimate a dynamic model adapted to each dialogue turn, which helps to improve the performance of the speech recognition (up to a 14.82% of relative improvement), which leads to an improvement in both the language understanding and the dialogue management tasks.
Resumo:
MFCC coefficients extracted from the power spectral density of speech as a whole, seems to have become the de facto standard in the area of speaker recognition, as demonstrated by its use in almost all systems submitted to the 2013 Speaker Recognition Evaluation (SRE) in Mobile Environment [1], thus relegating to background this component of the recognition systems. However, in this article we will show that selecting the adequate speaker characterization system is as important as the selection of the classifier. To accomplish this we will compare the recognition rates achieved by different recognition systems that relies on the same classifier (GMM-UBM) but connected with different feature extraction systems (based on both classical and biometric parameters). As a result we will show that a gender dependent biometric parameterization with a simple recognition system based on GMM- UBM paradigm provides very competitive or even better recognition rates when compared to more complex classification systems based on classical features