951 resultados para Speech Recognition Systems
Resumo:
This paper predicts speech synthesis, speech recognition, and speaker recognition technology for the year 2001, and it describes the most important research problems to be solved in order to arrive at these ultimate synthesis and recognition systems. The problems for speech synthesis include natural and intelligible voice production, prosody control based on meaning, capability of controlling synthesized voice quality and choosing individual speaking style, multilingual and multidialectal synthesis, choice of application-oriented speaking styles, capability of adding emotion, and synthesis from concepts. The problems for speech recognition include robust recognition against speech variations, adaptation/normalization to variations due to environmental conditions and speakers, automatic knowledge acquisition for acoustic and linguistic modeling, spontaneous speech recognition, naturalness and ease of human-machine interaction, and recognition of emotion. The problems for speaker recognition are similar to those for speech recognition. The research topics related to all these techniques include the use of articulatory and perceptual constraints and evaluation methods for measuring the quality of technology and systems.
Resumo:
Research in speech recognition and synthesis over the past several decades has brought speech technology to a point where it is being used in "real-world" applications. However, despite the progress, the perception remains that the current technology is not flexible enough to allow easy voice communication with machines. The focus of speech research is now on producing systems that are accurate and robust but that do not impose unnecessary constraints on the user. This chapter takes a critical look at the shortcomings of the current speech recognition and synthesis algorithms, discusses the technical challenges facing research, and examines the new directions that research in speech recognition and synthesis must take in order to form the basis of new solutions suitable for supporting a wide range of applications.
Resumo:
These are the full proceedings of the conference.
Resumo:
In this report we summarize the state-of-the-art of speech emotion recognition from the signal processing point of view. On the bases of multi-corporal experiments with machine-learning classifiers, the observation is made that existing approaches for supervised machine learning lead to database dependent classifiers which can not be applied for multi-language speech emotion recognition without additional training because they discriminate the emotion classes following the used training language. As there are experimental results showing that Humans can perform language independent categorisation, we made a parallel between machine recognition and the cognitive process and tried to discover the sources of these divergent results. The analysis suggests that the main difference is that the speech perception allows extraction of language independent features although language dependent features are incorporated in all levels of the speech signal and play as a strong discriminative function in human perception. Based on several results in related domains, we have suggested that in addition, the cognitive process of emotion-recognition is based on categorisation, assisted by some hierarchical structure of the emotional categories, existing in the cognitive space of all humans. We propose a strategy for developing language independent machine emotion recognition, related to the identification of language independent speech features and the use of additional information from visual (expression) features.
Resumo:
We propose a study of the mathematical properties of voice as an audio signal -- This work includes signals in which the channel conditions are not ideal for emotion recognition -- Multiresolution analysis- discrete wavelet transform – was performed through the use of Daubechies Wavelet Family (Db1-Haar, Db6, Db8, Db10) allowing the decomposition of the initial audio signal into sets of coefficients on which a set of features was extracted and analyzed statistically in order to differentiate emotional states -- ANNs proved to be a system that allows an appropriate classification of such states -- This study shows that the extracted features using wavelet decomposition are enough to analyze and extract emotional content in audio signals presenting a high accuracy rate in classification of emotional states without the need to use other kinds of classical frequency-time features -- Accordingly, this paper seeks to characterize mathematically the six basic emotions in humans: boredom, disgust, happiness, anxiety, anger and sadness, also included the neutrality, for a total of seven states to identify
Resumo:
We propose a novel analysis alternative, based on two Fourier Transforms for emotion recognition from speech -- Fourier analysis allows for display and synthesizes different signals, in terms of power spectral density distributions -- A spectrogram of the voice signal is obtained performing a short time Fourier Transform with Gaussian windows, this spectrogram portraits frequency related features, such as vocal tract resonances and quasi-periodic excitations during voiced sounds -- Emotions induce such characteristics in speech, which become apparent in spectrogram time-frequency distributions -- Later, the signal time-frequency representation from spectrogram is considered an image, and processed through a 2-dimensional Fourier Transform in order to perform the spatial Fourier analysis from it -- Finally features related with emotions in voiced speech are extracted and presented
Resumo:
The differences in spectral shape resolution abilities among cochlear implant ~CI! listeners, and between CI and normal-hearing ~NH! listeners, when listening with the same number of channels ~12!, was investigated. In addition, the effect of the number of channels on spectral shape resolution was examined. The stimuli were rippled noise signals with various ripple frequency-spacings. An adaptive 4IFC procedure was used to determine the threshold for resolvable ripple spacing, which was the spacing at which an interchange in peak and valley positions could be discriminated. The results showed poorer spectral shape resolution in CI compared to NH listeners ~average thresholds of approximately 3000 and 400 Hz, respectively!, and wide variability among CI listeners ~range of approximately 800 to 8000 Hz!. There was a significant relationship between spectral shape resolution and vowel recognition. The spectral shape resolution thresholds of NH listeners increased as the number of channels increased from 1 to 16, while the CI listeners showed a performance plateau at 4–6 channels, which is consistent with previous results using speech recognition measures. These results indicate that this test may provide a measure of CI performance which is time efficient and non-linguistic, and therefore, if verified, may provide a useful contribution to the prediction of speech perception in adults and children who use CIs.
Resumo:
The evolution of a positive genetic correlation between male and female components of mate recognition systems will result as a consequence of assortative mating and, in particular, is central to a number of theories of sexual selection. Although the existence of such genetic correlations has been investigated in a number of taxa, it has yet to be shown that such correlations evolve and whether they may evolve as rapidly as suggested by sexual selection models. In this study, I used a hybridization experiment to disrupt natural mate recognition systems and then observed the subsequent evolutionary dynamics of the genetic correlation between male and female components for 56 generations in hybrids between Drosophila serrata and Drosophila birchii. The genetic correlation between male and female components evolved from 0.388 at generation 5 to 1.017 at generation 37 and then declined to -0.040 after a further 19 generations. These results indicated that the genetic basis of the mate recognition system in the hybrid populations evolved rapidly. The initial rapid increase in the genetic correlation was consistent with the classic assumption that male and female components will coevolve under sexual selection. The subsequent decline in genetic correlation may be attributable to the fixation of major genes or, alternatively, may be a result of a cyclic evolutionary change in mate recognition.
Resumo:
Audiometry is the main way with which hearing is evaluated, because it is a universal and standardized test. Speech tests are difficult to standardize due to the variables involved, their performance in the presence of competitive noise is of great importance. Aim: To characterize speech intelligibility in silence and in competitive noise from individuals exposed to electronically amplified music. Material and Method: It was performed with 20 university students who presented normal hearing thresholds. The speech recognition rate (SRR) was performed after fourteen hours of sound rest after the exposure to electronically amplified music and once again after sound rest, being studied in three stages: without competitive noise, in the presence of Babble-type competitive noise, in monotic listening, in signal/ noise ratio of + 5 dB and with the signal/ noise ratio of 5 dB. Results: There was greater damage in the SRR after exposure to the music and with competitive noise, and as the signal/ noise ratio decreases, the performance of individuals in the test also decreased. Conclusion: The inclusion of competitive noise in the speech tests in the audiological routine is important, because it represents the real disadvantage experienced by individuals in daily listening.
Resumo:
Sendo uma forma natural de interação homem-máquina, o reconhecimento de gestos implica uma forte componente de investigação em áreas como a visão por computador e a aprendizagem computacional. O reconhecimento gestual é uma área com aplicações muito diversas, fornecendo aos utilizadores uma forma mais natural e mais simples de comunicar com sistemas baseados em computador, sem a necessidade de utilização de dispositivos extras. Assim, o objectivo principal da investigação na área de reconhecimento de gestos aplicada à interacção homemmáquina é o da criação de sistemas, que possam identificar gestos específicos e usálos para transmitir informações ou para controlar dispositivos. Para isso as interfaces baseados em visão para o reconhecimento de gestos, necessitam de detectar a mão de forma rápida e robusta e de serem capazes de efetuar o reconhecimento de gestos em tempo real. Hoje em dia, os sistemas de reconhecimento de gestos baseados em visão são capazes de trabalhar com soluções específicas, construídos para resolver um determinado problema e configurados para trabalhar de uma forma particular. Este projeto de investigação estudou e implementou soluções, suficientemente genéricas, com o recurso a algoritmos de aprendizagem computacional, permitindo a sua aplicação num conjunto alargado de sistemas de interface homem-máquina, para reconhecimento de gestos em tempo real. A solução proposta, Gesture Learning Module Architecture (GeLMA), permite de forma simples definir um conjunto de comandos que pode ser baseado em gestos estáticos e dinâmicos e que pode ser facilmente integrado e configurado para ser utilizado numa série de aplicações. É um sistema de baixo custo e fácil de treinar e usar, e uma vez que é construído unicamente com bibliotecas de código. As experiências realizadas permitiram mostrar que o sistema atingiu uma precisão de 99,2% em termos de reconhecimento de gestos estáticos e uma precisão média de 93,7% em termos de reconhecimento de gestos dinâmicos. Para validar a solução proposta, foram implementados dois sistemas completos. O primeiro é um sistema em tempo real capaz de ajudar um árbitro a arbitrar um jogo de futebol robótico. A solução proposta combina um sistema de reconhecimento de gestos baseada em visão com a definição de uma linguagem formal, o CommLang Referee, à qual demos a designação de Referee Command Language Interface System (ReCLIS). O sistema identifica os comandos baseados num conjunto de gestos estáticos e dinâmicos executados pelo árbitro, sendo este posteriormente enviado para um interface de computador que transmite a respectiva informação para os robôs. O segundo é um sistema em tempo real capaz de interpretar um subconjunto da Linguagem Gestual Portuguesa. As experiências demonstraram que o sistema foi capaz de reconhecer as vogais em tempo real de forma fiável. Embora a solução implementada apenas tenha sido treinada para reconhecer as cinco vogais, o sistema é facilmente extensível para reconhecer o resto do alfabeto. As experiências também permitiram mostrar que a base dos sistemas de interação baseados em visão pode ser a mesma para todas as aplicações e, deste modo facilitar a sua implementação. A solução proposta tem ainda a vantagem de ser suficientemente genérica e uma base sólida para o desenvolvimento de sistemas baseados em reconhecimento gestual que podem ser facilmente integrados com qualquer aplicação de interface homem-máquina. A linguagem formal de definição da interface pode ser redefinida e o sistema pode ser facilmente configurado e treinado com um conjunto de gestos diferentes de forma a serem integrados na solução final.
Resumo:
Mestrado em Engenharia Informática, Área de Especialização em Tecnologias do Conhecimento e da Decisão
Resumo:
Human Activity Recognition systems require objective and reliable methods that can be used in the daily routine and must offer consistent results according with the performed activities. These systems are under development and offer objective and personalized support for several applications such as the healthcare area. This thesis aims to create a framework for human activities recognition based on accelerometry signals. Some new features and techniques inspired in the audio recognition methodology are introduced in this work, namely Log Scale Power Bandwidth and the Markov Models application. The Forward Feature Selection was adopted as the feature selection algorithm in order to improve the clustering performances and limit the computational demands. This method selects the most suitable set of features for activities recognition in accelerometry from a 423th dimensional feature vector. Several Machine Learning algorithms were applied to the used accelerometry databases – FCHA and PAMAP databases - and these showed promising results in activities recognition. The developed algorithm set constitutes a mighty contribution for the development of reliable evaluation methods of movement disorders for diagnosis and treatment applications.
Resumo:
Tese de Doutoramento em Engenharia de Eletrónica e de Computadores
Resumo:
Recognition systems play a key role in a range of biological processes, including mate choice, immune defence and altruistic behaviour. Social insects provide an excellent model for studying recognition systems because workers need to discriminate between nestmates and non-nestmates, enabling them to direct altruistic behaviour towards closer kin and to repel potential invaders. However, the level of aggression directed towards conspecific intruders can vary enormously, even among workers within the same colony. This is usually attributed to differences in the aggression thresholds of individuals or to workers having different roles within the colony. Recent evidence from the weaver ant Oecophylla smaragdina suggests that this does not tell the whole story. Here I propose a new model for nestmate recognition based on a vector template derived from both the individual's innate odour and the shared colony odour. This model accounts for the recent findings concerning weaver ants, and also provides an alternative explanation for why the level of aggression expressed by a colony decreases as the diversity within the colony increases, even when odour is well-mixed. The model makes additional predictions that are easily tested, and represents a significant advance in our conceptualisation of recognition systems.