904 resultados para audio-visual automatic speech recognition
Resumo:
ABSTRACT This thesis is composed of two main parts. The first addressed the question of whether the auditory and somatosensory systems, like their visual counterpart, comprise parallel functional pathways for processing identity and spatial attributes (so-called `what' and `where' pathways, respectively). The second part examined the independence of control processes mediating task switching across 'what' and `where' pathways in the auditory and visual modalities. Concerning the first part, electrical neuroimaging of event-related potentials identified the spatio-temporal mechanisms subserving auditory (see Appendix, Study n°1) and vibrotactile (see Appendix, Study n°2) processing during two types of blocks of trials. `What' blocks varied stimuli in their frequency independently of their location.. `Where' blocks varied the same stimuli in their location independently of their frequency. Concerning the second part (see Appendix, Study n°3), a psychophysical task-switching paradigm was used to investigate the hypothesis that the efficacy of control processes depends on the extent of overlap between the neural circuitry mediating the different tasks at hand, such that more effective task preparation (and by extension smaller switch costs) is achieved when the anatomical/functional overlap of this circuitry is small. Performance costs associated with switching tasks and/or switching sensory modalities were measured. Tasks required the analysis of either the identity or spatial location of environmental objects (`what' and `where' tasks, respectively) that were presented either visually or acoustically on any given trial. Pretrial cues informed participants of the upcoming task, but not of the sensory modality. - In the audio-visual domain, the results showed that switch costs between tasks were significantly smaller when the sensory modality of the task switched versus when it repeated. In addition, switch costs between the senses were correlated only when the sensory modality of the task repeated across trials and not when it switched. The collective evidence not only supports the independence of control processes mediating task switching and modality switching, but also the hypothesis that switch costs reflect competitive interterence between neural circuits that in turn can be diminished when these neural circuits are distinct. - In the auditory and somatosensory domains, the findings show that a segregation of location vs. recognition information is observed across sensory systems and that these happen around 100ms for both sensory modalities. - Also, our results show that functionally specialized pathways for audition and somatosensation involve largely overlapping brain regions, i.e. posterior superior and middle temporal cortices and inferior parietal areas. Both these properties (synchrony of differential processing and overlapping brain regions) probably optimize the relationships across sensory modalities. - Therefore, these results may be indicative of a computationally advantageous organization for processing spatial anal identity information.
Resumo:
Este documento es una introducción a las herramientas Dragon Naturally Speaking y Audacity, especializadas en optimizar la transcripción de archivos sonoros.
Resumo:
Understanding the basis on which recruiters form hirability impressions for a job applicant is a key issue in organizational psychology and can be addressed as a social computing problem. We approach the problem from a face-to-face, nonverbal perspective where behavioral feature extraction and inference are automated. This paper presents a computational framework for the automatic prediction of hirability. To this end, we collected an audio-visual dataset of real job interviews where candidates were applying for a marketing job. We automatically extracted audio and visual behavioral cues related to both the applicant and the interviewer. We then evaluated several regression methods for the prediction of hirability scores and showed the feasibility of conducting such a task, with ridge regression explaining 36.2% of the variance. Feature groups were analyzed, and two main groups of behavioral cues were predictive of hirability: applicant audio features and interviewer visual cues, showing the predictive validity of cues related not only to the applicant, but also to the interviewer. As a last step, we analyzed the predictive validity of psychometric questionnaires often used in the personnel selection process, and found that these questionnaires were unable to predict hirability, suggesting that hirability impressions were formed based on the interaction during the interview rather than on questionnaire data.
Resumo:
As part of the Affective Computing research field, the development of automatic affective recognition systems can enhance human-computer interactions by allowing the creation of interfaces that react to the user's emotional state. To that end, this Master Thesis brings affect recognition to nowadays most used human computer interface, mobile devices, by developing a facial expression recognition system able to perform detection under the difficult conditions of viewing angle and illumination that entails the interaction with a mobile device. Moreover, this Master Thesis proposes to combine emotional features detected from expression with contextual information of the current situation, to infer a complex and extensive emotional state of the user. Thus, a cognitive computational model of emotion is defined that provides a multicomponential affective state of the user through the integration of the detected emotional features into appraisal processes. In order to account for individual differences in the emotional experience, these processes can be adapted to the culture and personality of the user.
Resumo:
The present report describes the development of a technique for automatic wheezing recognition in digitally recorded lung sounds. This method is based on the extraction and processing of spectral information from the respiratory cycle and the use of these data for user feedback and automatic recognition. The respiratory cycle is first pre-processed, in order to normalize its spectral information, and its spectrogram is then computed. After this procedure, the spectrogram image is processed by a two-dimensional convolution filter and a half-threshold in order to increase the contrast and isolate its highest amplitude components, respectively. Thus, in order to generate more compressed data to automatic recognition, the spectral projection from the processed spectrogram is computed and stored as an array. The higher magnitude values of the array and its respective spectral values are then located and used as inputs to a multi-layer perceptron artificial neural network, which results an automatic indication about the presence of wheezes. For validation of the methodology, lung sounds recorded from three different repositories were used. The results show that the proposed technique achieves 84.82% accuracy in the detection of wheezing for an isolated respiratory cycle and 92.86% accuracy for the detection of wheezes when detection is carried out using groups of respiratory cycles obtained from the same person. Also, the system presents the original recorded sound and the post-processed spectrogram image for the user to draw his own conclusions from the data.
Resumo:
Cette étude se penche sur le geste singulier se dégageant de l’œuvre du cinéaste sénégalais Djibril Diop Mambety. Une force de « mise en présence » y est identifiée, dont la présente recherche démontre qu’elle s’apparente à l’action médiatrice du griot des traditions orales d’Afrique de l’Ouest. Singulièrement, cette force tenant de l’oralité ne repose pas sur le récit ou la parole comme discours, mais relève au contraire de ruptures narratives et de disjonctions image-son qui mettent le récit en question, invitant le spectateur à fréquemment réviser son interprétation de ce qu’il voit et entend. C’est le film lui-même qui devient alors griot, actualisant un lien en constante transformation entre l’univers qu’il porte et son spectateur. En instaurant un rapport critique à l’égard du monde dans lequel s’inscrit le récit, les multiples ruptures dans le cinéma de Mambety sont également les brèches par lesquelles se crée un espace d’accueil pour la marginalité, qui habite tous ses films. La tradition orale et le griot sont présentés en premier lieu, de manière à poser les bases à partir desquelles peut se développer la réflexion. La description et l’analyse des films Parlons Grand-mère et Le franc démontrent en quoi ceux-ci sont des films médiateurs, qui se comportent en griots. Cette découverte ouvre la voie à une réflexion plus large sur la médiation au cinéma, où la portée éthique du film-médiateur est explorée, ainsi que la nature des relations possibles entre médiation et récit. Finalement, l’analyse du film Hyènes, eu égard à la différence qu’il présente en déployant un récit plus linéaire, est l’occasion d’approfondir une compréhension à la fois de ce que font les films de Mambety et de ce que peut la médiation au cinéma de façon plus large.
Resumo:
Motivation for Speaker recognition work is presented in the first part of the thesis. An exhaustive survey of past work in this field is also presented. A low cost system not including complex computation has been chosen for implementation. Towards achieving this a PC based system is designed and developed. A front end analog to digital convertor (12 bit) is built and interfaced to a PC. Software to control the ADC and to perform various analytical functions including feature vector evaluation is developed. It is shown that a fixed set of phrases incorporating evenly balanced phonemes is aptly suited for the speaker recognition work at hand. A set of phrases are chosen for recognition. Two new methods are adopted for the feature evaluation. Some new measurements involving a symmetry check method for pitch period detection and ACE‘ are used as featured. Arguments are provided to show the need for a new model for speech production. Starting from heuristic, a knowledge based (KB) speech production model is presented. In this model, a KB provides impulses to a voice producing mechanism and constant correction is applied via a feedback path. It is this correction that differs from speaker to speaker. Methods of defining measurable parameters for use as features are described. Algorithms for speaker recognition are developed and implemented. Two methods are presented. The first is based on the model postulated. Here the entropy on the utterance of a phoneme is evaluated. The transitions of voiced regions are used as speaker dependent features. The second method presented uses features found in other works, but evaluated differently. A knock—out scheme is used to provide the weightage values for the selection of features. Results of implementation are presented which show on an average of 80% recognition. It is also shown that if there are long gaps between sessions, the performance deteriorates and is speaker dependent. Cross recognition percentages are also presented and this in the worst case rises to 30% while the best case is 0%. Suggestions for further work are given in the concluding chapter.
Resumo:
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data
Resumo:
Sketches are commonly used in the early stages of design. Our previous system allows users to sketch mechanical systems that the computer interprets. However, some parts of the mechanical system might be too hard or too complicated to express in the sketch. Adding speech recognition to create a multimodal system would move us toward our goal of creating a more natural user interface. This thesis examines the relationship between the verbal and sketch input, particularly how to segment and align the two inputs. Toward this end, subjects were recorded while they sketched and talked. These recordings were transcribed, and a set of rules to perform segmentation and alignment was created. These rules represent the knowledge that the computer needs to perform segmentation and alignment. The rules successfully interpreted the 24 data sets that they were given.
Resumo:
This thesis presents a perceptual system for a humanoid robot that integrates abilities such as object localization and recognition with the deeper developmental machinery required to forge those competences out of raw physical experiences. It shows that a robotic platform can build up and maintain a system for object localization, segmentation, and recognition, starting from very little. What the robot starts with is a direct solution to achieving figure/ground separation: it simply 'pokes around' in a region of visual ambiguity and watches what happens. If the arm passes through an area, that area is recognized as free space. If the arm collides with an object, causing it to move, the robot can use that motion to segment the object from the background. Once the robot can acquire reliable segmented views of objects, it learns from them, and from then on recognizes and segments those objects without further contact. Both low-level and high-level visual features can also be learned in this way, and examples are presented for both: orientation detection and affordance recognition, respectively. The motivation for this work is simple. Training on large corpora of annotated real-world data has proven crucial for creating robust solutions to perceptual problems such as speech recognition and face detection. But the powerful tools used during training of such systems are typically stripped away at deployment. Ideally they should remain, particularly for unstable tasks such as object detection, where the set of objects needed in a task tomorrow might be different from the set of objects needed today. The key limiting factor is access to training data, but as this thesis shows, that need not be a problem on a robotic platform that can actively probe its environment, and carry out experiments to resolve ambiguity. This work is an instance of a general approach to learning a new perceptual judgment: find special situations in which the perceptual judgment is easy and study these situations to find correlated features that can be observed more generally.
Resumo:
A fast simulated annealing algorithm is developed for automatic object recognition. The normalized correlation coefficient is used as a measure of the match between a hypothesized object and an image. Templates are generated on-line during the search by transforming model images. Simulated annealing reduces the search time by orders of magnitude with respect to an exhaustive search. The algorithm is applied to the problem of how landmarks, for example, traffic signs, can be recognized by an autonomous vehicle or a navigating robot. The algorithm works well in noisy, real-world images of complicated scenes for model images with high information content.
Resumo:
This paper sketches a hypothetical cortical architecture for visual 3D object recognition based on a recent computational model. The view-centered scheme relies on modules for learning from examples, such as Hyperbf-like networks. Such models capture a class of explanations we call Memory-Based Models (MBM) that contains sparse population coding, memory-based recognition, and codebooks of prototypes. Unlike the sigmoidal units of some artificial neural networks, the units of MBMs are consistent with the description of cortical neurons. We describe how an example of MBM may be realized in terms of cortical circuitry and biophysical mechanisms, consistent with psychophysical and physiological data.
Resumo:
Crear un material audio-visual. Mejorar la calidad de la enseñanza. Estudiar la aplicación de programas audio-visuales en el aula. Buscar una metodología adecuada a la utilización didáctica de los medios audio-visuales. Comprobar las diferencias que pueden existir entre diferentes medios audio-visuales, diapositivas-vídeo. La muestra está formada por los niños de tres aulas de segundo de BUP del Colegio Escoles Pies de Sarrià (Barcelona). En total 102 sujetos que han estudiado primero de BUP en el mismo centro. Se expone el marco teórico. Se describen las variables (medios audio-visuales, rendimiento escolar, rendimiento escolar anterior, metodología, inteligencia, clase social, profesor y edad). Se describe la muestra. División de la muestra en tres clases (sin medio audio-visual, con vídeo, con diapositivas). Realización del material audio-visual. Se realizan las sesiones pertinentes en cada clase. Aplicación de la prueba objetiva. Se analizan los datos. Se ofrecen conclusiones y alternativas. Prueba objetiva de rendimiento. Test d'aptituds diferencials. Baremo de puntuaciones anteriores. Diferencia de medias, estadística descriptiva, análisis de varianza, prueba de Scheffe, para establecer si hay diferencias entre el grupo que ha trabajado con medio audio-visual, visual y sin medio audiovisual. La metodología experimental aplicada no ha producido los resultados esperados, hay razones para afirmar que han intervenido factores no controlados, ajenos a la experimentación. Se constata un gran interés de los alumnos por el uso del vídeo como elemento de motivación. Se señala la importancia de incidir en este campo creando metodologías activas adecuadas y series de programas válidos. Hace falta una intensa investigación en las posibilidades y efectos de dichas metodologías.
Resumo:
This paper discusses a study on postlingual cochlear implantees and the effectiveness of the CST in evaluating enhancement of speech recognition abilities.
Resumo:
Difficulty understanding speech in the presence of background noise is a common report among cochlear implant recipients. The purpose of this research is to evaluate speech processing options currently available in the Cochlear Nucleus 5 sound processor to determine the best option for improving speech recognition in noise.