904 resultados para Audio-Visual Automatic Speech Recognition
Resumo:
Computer vision-based food recognition could be used to estimate a meal's carbohydrate content for diabetic patients. This study proposes a methodology for automatic food recognition, based on the Bag of Features (BoF) model. An extensive technical investigation was conducted for the identification and optimization of the best performing components involved in the BoF architecture, as well as the estimation of the corresponding parameters. For the design and evaluation of the prototype system, a visual dataset with nearly 5,000 food images was created and organized into 11 classes. The optimized system computes dense local features, using the scale-invariant feature transform on the HSV color space, builds a visual dictionary of 10,000 visual words by using the hierarchical k-means clustering and finally classifies the food images with a linear support vector machine classifier. The system achieved classification accuracy of the order of 78%, thus proving the feasibility of the proposed approach in a very challenging image dataset.
Resumo:
La cuestión principal abordada en esta tesis doctoral es la mejora de los sistemas biométricos de reconocimiento de personas a partir de la voz, proponiendo el uso de una nueva parametrización, que hemos denominado parametrización biométrica extendida dependiente de género (GDEBP en sus siglas en inglés). No se propone una ruptura completa respecto a los parámetros clásicos sino una nueva forma de utilizarlos y complementarlos. En concreto, proponemos el uso de parámetros diferentes dependiendo del género del locutor, ya que como es bien sabido, la voz masculina y femenina presentan características diferentes que deberán modelarse, por tanto, de diferente manera. Además complementamos los parámetros clásicos utilizados (MFFC extraídos de la señal de voz), con un nuevo conjunto de parámetros extraídos a partir de la deconstrucción de la señal de voz en sus componentes de fuente glótica (más relacionada con el proceso y órganos de fonación y por tanto con características físicas del locutor) y de tracto vocal (más relacionada con la articulación acústica y por tanto con el mensaje emitido). Para verificar la validez de esta propuesta se plantean diversos escenarios, utilizando diferentes bases de datos, para validar que la GDEBP permite generar una descripción más precisa de los locutores que los parámetros MFCC clásicos independientes del género. En concreto se plantean diferentes escenarios de identificación sobre texto restringido y texto independiente utilizando las bases de datos de HESPERIA y ALBAYZIN. El trabajo también se completa con la participación en dos competiciones internacionales de reconocimiento de locutor, NIST SRE (2010 y 2012) y MOBIO 2013. En el primer caso debido a la naturaleza de las bases de datos utilizadas se obtuvieron resultados cercanos al estado del arte, mientras que en el segundo de los casos el sistema presentado obtuvo la mejor tasa de reconocimiento para locutores femeninos. A pesar de que el objetivo principal de esta tesis no es el estudio de sistemas de clasificación, sí ha sido necesario analizar el rendimiento de diferentes sistemas de clasificación, para ver el rendimiento de la parametrización propuesta. En concreto, se ha abordado el uso de sistemas de reconocimiento basados en el paradigma GMM-UBM, supervectores e i-vectors. Los resultados que se presentan confirman que la utilización de características que permitan describir los locutores de manera más precisa es en cierto modo más importante que la elección del sistema de clasificación utilizado por el sistema. En este sentido la parametrización propuesta supone un paso adelante en la mejora de los sistemas de reconocimiento biométrico de personas por la voz, ya que incluso con sistemas de clasificación relativamente simples se consiguen tasas de reconocimiento realmente competitivas. ABSTRACT The main question addressed in this thesis is the improvement of automatic speaker recognition systems, by the introduction of a new front-end module that we have called Gender Dependent Extended Biometric Parameterisation (GDEBP). This front-end do not constitute a complete break with respect to classical parameterisation techniques used in speaker recognition but a new way to obtain these parameters while introducing some complementary ones. Specifically, we propose a gender-dependent parameterisation, since as it is well known male and female voices have different characteristic, and therefore the use of different parameters to model these distinguishing characteristics should provide a better characterisation of speakers. Additionally, we propose the introduction of a new set of biometric parameters extracted from the components which result from the deconstruction of the voice into its glottal source estimate (close related to the phonation process and the involved organs, and therefore the physical characteristics of the speaker) and vocal tract estimate (close related to acoustic articulation and therefore to the spoken message). These biometric parameters constitute a complement to the classical MFCC extracted from the power spectral density of speech as a whole. In order to check the validity of this proposal we establish different practical scenarios, using different databases, so we can conclude that a GDEBP generates a more accurate description of speakers than classical approaches based on gender-independent MFCC. Specifically, we propose scenarios based on text-constrain and text-independent test using HESPERIA and ALBAYZIN databases. This work is also completed with the participation in two international speaker recognition evaluations: NIST SRE (2010 and 2012) and MOBIO 2013, with diverse results. In the first case, due to the nature of the NIST databases, we obtain results closed to state-of-the-art although confirming our hypothesis, whereas in the MOBIO SRE we obtain the best simple system performance for female speakers. Although the study of classification systems is beyond the scope of this thesis, we found it necessary to analise the performance of different classification systems, in order to verify the effect of them on the propose parameterisation. In particular, we have addressed the use of speaker recognition systems based on the GMM-UBM paradigm, supervectors and i-vectors. The presented results confirm that the selection of a set of parameters that allows for a more accurate description of the speakers is as important as the selection of the classification method used by the biometric system. In this sense, the proposed parameterisation constitutes a step forward in improving speaker recognition systems, since even when using relatively simple classification systems, really competitive recognition rates are achieved.
Resumo:
This paper describes the GTH-UPM system for the Albayzin 2014 Search on Speech Evaluation. Teh evaluation task consists of searching a list of terms/queries in audio files. The GTH-UPM system we are presenting is based on a LVCSR (Large Vocabulary Continuous Speech Recognition) system. We have used MAVIR corpus and the Spanish partition of the EPPS (European Parliament Plenary Sessions) database for training both acoustic and language models. The main effort has been focused on lexicon preparation and text selection for the language model construction. The system makes use of different lexicon and language models depending on the task that is performed. For the best configuration of the system on the development set, we have obtained a FOM of 75.27 for the deyword spotting task.
Resumo:
This paper predicts speech synthesis, speech recognition, and speaker recognition technology for the year 2001, and it describes the most important research problems to be solved in order to arrive at these ultimate synthesis and recognition systems. The problems for speech synthesis include natural and intelligible voice production, prosody control based on meaning, capability of controlling synthesized voice quality and choosing individual speaking style, multilingual and multidialectal synthesis, choice of application-oriented speaking styles, capability of adding emotion, and synthesis from concepts. The problems for speech recognition include robust recognition against speech variations, adaptation/normalization to variations due to environmental conditions and speakers, automatic knowledge acquisition for acoustic and linguistic modeling, spontaneous speech recognition, naturalness and ease of human-machine interaction, and recognition of emotion. The problems for speaker recognition are similar to those for speech recognition. The research topics related to all these techniques include the use of articulatory and perceptual constraints and evaluation methods for measuring the quality of technology and systems.
Resumo:
The production and perception of music is a multimodal activity involving auditory, visual and conceptual processing, integrating these with prior knowledge and environmental experience. Musicians utilise expressive physical nuances to highlight salient features of the score. The question arises within the literature as to whether performers’ non-technical, non-sound-producing movements may be communicatively meaningful and convey important structural information to audience members and co-performers. In the light of previous performance research (Vines et al., 2006, Wanderley, 2002, Davidson, 1993), and considering findings within co-speech gestural research and auditory and audio-visual neuroscience, this thesis examines the nature of those movements not directly necessary for the production of sound, and their particular influence on audience perception. Within the current research 3D performance analysis is conducted using the Vicon 12- camera system and Nexus data-processing software. Performance gestures are identified as repeated patterns of motion relating to music structure, which not only express phrasing and structural hierarchy but are consistently and accurately interpreted as such by a perceiving audience. Gestural characteristics are analysed across performers and performance style using two Chopin preludes selected for their diverse yet comparable structures (Opus 28:7 and 6). Effects on perceptual judgements of presentation modes (visual-only, auditory-only, audiovisual, full- and point-light) and viewing conditions are explored. This thesis argues that while performance style is highly idiosyncratic, piano performers reliably generate structural gestures through repeated patterns of upper-body movement. The shapes and locations of phrasing motions are identified particular to the sample of performers investigated. Findings demonstrate that despite the personalised nature of the gestures, performers use increased velocity of movements to emphasise musical structure and that observers accurately and consistently locate phrasing junctures where these patterns and variation in motion magnitude, shape and velocity occur. By viewing performance motions in polar (spherical) rather than cartesian coordinate space it is possible to get mathematically closer to the movement generated by each of the nine performers, revealing distinct patterns of motion relating to phrasing structures, regardless of intended performance style. These patterns are highly individualised both to each performer and performed piece. Instantaneous velocity analysis indicates a right-directed bias of performance motion variation at salient structural features within individual performances. Perceptual analyses demonstrate that audience members are able to accurately and effectively detect phrasing structure from performance motion alone. This ability persists even for degraded point-light performances, where all extraneous environmental information has been removed. The relative contributions of audio, visual and audiovisual judgements demonstrate that the visual component of a performance does positively impact on the over- all accuracy of phrasing judgements, indicating that receivers are most effective in their recognition of structural segmentations when they can both see and hear a performance. Observers appear to make use of a rapid online judgement heuristics, adjusting response processes quickly to adapt and perform accurately across multiple modes of presentation and performance style. In line with existent theories within the literature, it is proposed that this processing ability may be related to cognitive and perceptual interpretation of syntax within gestural communication during social interaction and speech. Findings of this research may have future impact on performance pedagogy, computational analysis and performance research, as well as potentially influencing future investigations of the cognitive aspects of musical and gestural understanding.
Resumo:
Audiometry is the main way with which hearing is evaluated, because it is a universal and standardized test. Speech tests are difficult to standardize due to the variables involved, their performance in the presence of competitive noise is of great importance. Aim: To characterize speech intelligibility in silence and in competitive noise from individuals exposed to electronically amplified music. Material and Method: It was performed with 20 university students who presented normal hearing thresholds. The speech recognition rate (SRR) was performed after fourteen hours of sound rest after the exposure to electronically amplified music and once again after sound rest, being studied in three stages: without competitive noise, in the presence of Babble-type competitive noise, in monotic listening, in signal/ noise ratio of + 5 dB and with the signal/ noise ratio of 5 dB. Results: There was greater damage in the SRR after exposure to the music and with competitive noise, and as the signal/ noise ratio decreases, the performance of individuals in the test also decreased. Conclusion: The inclusion of competitive noise in the speech tests in the audiological routine is important, because it represents the real disadvantage experienced by individuals in daily listening.
Resumo:
Dissertação apresentada para obtenção do grau de Mestre em Educação Matemática na Educação Pré-Escolar e nos 1º e 2º Ciclos do Ensino Básico na especialidade de Didática da Matemática
Resumo:
To become an open to outer space, the "museum" acquired new forms and new expressions. The complexity of museological activity thus leads to new representations that alter the initial image of the museum as a building with objects. Their 'boundaries' are now less sharp, not only in relation to the spatial relationship, but also to its temporal dimension, creating an additional challenge which is the recognition of the museum itself. The design, while transdisciplinary activity, thereby assumes a key role in the communication of the museums in its visual representation and recognition of their action. The present study results from a survey conducted in 2010 to 364 Portuguese museums (from a universe of 849 museums), presenting an analysis to its base elements of visual expression of identity (name, logo, symbol, and color).
Resumo:
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia Informática
Resumo:
Mestrado em Engenharia Informática, Área de Especialização em Tecnologias do Conhecimento e da Decisão
Resumo:
In this work an adaptive modeling and spectral estimation scheme based on a dual Discrete Kalman Filtering (DKF) is proposed for speech enhancement. Both speech and noise signals are modeled by an autoregressive structure which provides an underlying time frame dependency and improves time-frequency resolution. The model parameters are arranged to obtain a combined state-space model and are also used to calculate instantaneous power spectral density estimates. The speech enhancement is performed by a dual discrete Kalman filter that simultaneously gives estimates for the models and the signals. This approach is particularly useful as a pre-processing module for parametric based speech recognition systems that rely on spectral time dependent models. The system performance has been evaluated by a set of human listeners and by spectral distances. In both cases the use of this pre-processing module has led to improved results.
Resumo:
Speech interfaces for Assistive Technologies are not common and are usually replaced by others. The market they are targeting is not considered attractive and speech technologies are still not well spread. Industry still thinks they present some performance risks, especially Speech Recognition systems. As speech is the most elemental and natural way for communication, it has strong potential for enhancing inclusion and quality of life for broader groups of users with special needs, such as people with cerebral palsy and elderly staying at their homes. This work is a position paper in which the authors argue for the need to make speech become the basic interface in assistive technologies. Among the main arguments, we can state: speech is the easiest way to interact with machines; there is a growing market for embedded speech in assistive technologies, since the number of disabled and elderly people is expanding; speech technology is already mature to be used but needs adaptation to people with special needs; there is still a lot of R&D to be done in this area, especially when thinking about the Portuguese market. The main challenges are presented and future directions are proposed.
Resumo:
En aquest projecte es fa una introducció als reconeixedors de la parla, el seu funcionament i la seva base matemàtica. Un cop tots els conceptes han quedat clars, es mostra el mètode de creació que hem seguit per obtenir el nostre propi reconeixedor de la parla, utilitzant les eines HTK, en català. S’avaluen les seves virtuts i els seus defectes a través de diferents proves realitzades als seus components. A més a més, el projecte arrodoneix la feina implementant un sistema de dictat automàtic que explota el reconeixedor de la parla utilitzant Julius.
Resumo:
Current research on sleep using experimental animals is limited by the expense and time-consuming nature of traditional EEG/EMG recordings. We present here an alternative, noninvasive approach utilizing piezoelectric films configured as highly sensitive motion detectors. These film strips attached to the floor of the rodent cage produce an electrical output in direct proportion to the distortion of the material. During sleep, movement associated with breathing is the predominant gross body movement and, thus, output from the piezoelectric transducer provided an accurate respiratory trace during sleep. During wake, respiratory movements are masked by other motor activities. An automatic pattern recognition system was developed to identify periods of sleep and wake using the piezoelectric generated signal. Due to the complex and highly variable waveforms that result from subtle postural adjustments in the animals, traditional signal analysis techniques were not sufficient for accurate classification of sleep versus wake. Therefore, a novel pattern recognition algorithm was developed that successfully distinguished sleep from wake in approximately 95% of all epochs. This algorithm may have general utility for a variety of signals in biomedical and engineering applications. This automated system for monitoring sleep is noninvasive, inexpensive, and may be useful for large-scale sleep studies including genetic approaches towards understanding sleep and sleep disorders, and the rapid screening of the efficacy of sleep or wake promoting drugs.
Resumo:
It has been demonstrated in earlier studies that patients with a cochlear implant have increased abilities for audio-visual integration because the crude information transmitted by the cochlear implant requires the persistent use of the complementary speech information from the visual channel. The brain network for these abilities needs to be clarified. We used an independent components analysis (ICA) of the activation (H2 (15) O) positron emission tomography data to explore occipito-temporal brain activity in post-lingually deaf patients with unilaterally implanted cochlear implants at several months post-implantation (T1), shortly after implantation (T0) and in normal hearing controls. In between-group analysis, patients at T1 had greater blood flow in the left middle temporal cortex as compared with T0 and normal hearing controls. In within-group analysis, patients at T0 had a task-related ICA component in the visual cortex, and patients at T1 had one task-related ICA component in the left middle temporal cortex and the other in the visual cortex. The time courses of temporal and visual activities during the positron emission tomography examination at T1 were highly correlated, meaning that synchronized integrative activity occurred. The greater involvement of the visual cortex and its close coupling with the temporal cortex at T1 confirm the importance of audio-visual integration in more experienced cochlear implant subjects at the cortical level.