873 resultados para Audio-visual Speech Recognition, Visual Feature Extraction, Free-parts, Monolithic, ROI
Resumo:
Cette thèse étudie des modèles de séquences de haute dimension basés sur des réseaux de neurones récurrents (RNN) et leur application à la musique et à la parole. Bien qu'en principe les RNN puissent représenter les dépendances à long terme et la dynamique temporelle complexe propres aux séquences d'intérêt comme la vidéo, l'audio et la langue naturelle, ceux-ci n'ont pas été utilisés à leur plein potentiel depuis leur introduction par Rumelhart et al. (1986a) en raison de la difficulté de les entraîner efficacement par descente de gradient. Récemment, l'application fructueuse de l'optimisation Hessian-free et d'autres techniques d'entraînement avancées ont entraîné la recrudescence de leur utilisation dans plusieurs systèmes de l'état de l'art. Le travail de cette thèse prend part à ce développement. L'idée centrale consiste à exploiter la flexibilité des RNN pour apprendre une description probabiliste de séquences de symboles, c'est-à-dire une information de haut niveau associée aux signaux observés, qui en retour pourra servir d'à priori pour améliorer la précision de la recherche d'information. Par exemple, en modélisant l'évolution de groupes de notes dans la musique polyphonique, d'accords dans une progression harmonique, de phonèmes dans un énoncé oral ou encore de sources individuelles dans un mélange audio, nous pouvons améliorer significativement les méthodes de transcription polyphonique, de reconnaissance d'accords, de reconnaissance de la parole et de séparation de sources audio respectivement. L'application pratique de nos modèles à ces tâches est détaillée dans les quatre derniers articles présentés dans cette thèse. Dans le premier article, nous remplaçons la couche de sortie d'un RNN par des machines de Boltzmann restreintes conditionnelles pour décrire des distributions de sortie multimodales beaucoup plus riches. Dans le deuxième article, nous évaluons et proposons des méthodes avancées pour entraîner les RNN. Dans les quatre derniers articles, nous examinons différentes façons de combiner nos modèles symboliques à des réseaux profonds et à la factorisation matricielle non-négative, notamment par des produits d'experts, des architectures entrée/sortie et des cadres génératifs généralisant les modèles de Markov cachés. Nous proposons et analysons également des méthodes d'inférence efficaces pour ces modèles, telles la recherche vorace chronologique, la recherche en faisceau à haute dimension, la recherche en faisceau élagué et la descente de gradient. Finalement, nous abordons les questions de l'étiquette biaisée, du maître imposant, du lissage temporel, de la régularisation et du pré-entraînement.
Resumo:
Content Based Image Retrieval is one of the prominent areas in Computer Vision and Image Processing. Recognition of handwritten characters has been a popular area of research for many years and still remains an open problem. The proposed system uses visual image queries for retrieving similar images from database of Malayalam handwritten characters. Local Binary Pattern (LBP) descriptors of the query images are extracted and those features are compared with the features of the images in database for retrieving desired characters. This system with local binary pattern gives excellent retrieval performance
Resumo:
Speech is the primary, most prominent and convenient means of communication in audible language. Through speech, people can express their thoughts, feelings or perceptions by the articulation of words. Human speech is a complex signal which is non stationary in nature. It consists of immensely rich information about the words spoken, accent, attitude of the speaker, expression, intention, sex, emotion as well as style. The main objective of Automatic Speech Recognition (ASR) is to identify whatever people speak by means of computer algorithms. This enables people to communicate with a computer in a natural spoken language. Automatic recognition of speech by machines has been one of the most exciting, significant and challenging areas of research in the field of signal processing over the past five to six decades. Despite the developments and intensive research done in this area, the performance of ASR is still lower than that of speech recognition by humans and is yet to achieve a completely reliable performance level. The main objective of this thesis is to develop an efficient speech recognition system for recognising speaker independent isolated words in Malayalam.
Resumo:
This thesis presents a perceptual system for a humanoid robot that integrates abilities such as object localization and recognition with the deeper developmental machinery required to forge those competences out of raw physical experiences. It shows that a robotic platform can build up and maintain a system for object localization, segmentation, and recognition, starting from very little. What the robot starts with is a direct solution to achieving figure/ground separation: it simply 'pokes around' in a region of visual ambiguity and watches what happens. If the arm passes through an area, that area is recognized as free space. If the arm collides with an object, causing it to move, the robot can use that motion to segment the object from the background. Once the robot can acquire reliable segmented views of objects, it learns from them, and from then on recognizes and segments those objects without further contact. Both low-level and high-level visual features can also be learned in this way, and examples are presented for both: orientation detection and affordance recognition, respectively. The motivation for this work is simple. Training on large corpora of annotated real-world data has proven crucial for creating robust solutions to perceptual problems such as speech recognition and face detection. But the powerful tools used during training of such systems are typically stripped away at deployment. Ideally they should remain, particularly for unstable tasks such as object detection, where the set of objects needed in a task tomorrow might be different from the set of objects needed today. The key limiting factor is access to training data, but as this thesis shows, that need not be a problem on a robotic platform that can actively probe its environment, and carry out experiments to resolve ambiguity. This work is an instance of a general approach to learning a new perceptual judgment: find special situations in which the perceptual judgment is easy and study these situations to find correlated features that can be observed more generally.
Resumo:
abstract With many visual speech animation techniques now available, there is a clear need for systematic perceptual evaluation schemes. We describe here our scheme and its application to a new video-realistic (potentially indistinguishable from real recorded video) visual-speech animation system, called Mary 101. Two types of experiments were performed: a) distinguishing visually between real and synthetic image- sequences of the same utterances, ("Turing tests") and b) gauging visual speech recognition by comparing lip-reading performance of the real and synthetic image-sequences of the same utterances ("Intelligibility tests"). Subjects that were presented randomly with either real or synthetic image-sequences could not tell the synthetic from the real sequences above chance level. The same subjects when asked to lip-read the utterances from the same image-sequences recognized speech from real image-sequences significantly better than from synthetic ones. However, performance for both, real and synthetic, were at levels suggested in the literature on lip-reading. We conclude from the two experiments that the animation of Mary 101 is adequate for providing a percept of a talking head. However, additional effort is required to improve the animation for lip-reading purposes like rehabilitation and language learning. In addition, these two tasks could be considered as explicit and implicit perceptual discrimination tasks. In the explicit task (a), each stimulus is classified directly as a synthetic or real image-sequence by detecting a possible difference between the synthetic and the real image-sequences. The implicit perceptual discrimination task (b) consists of a comparison between visual recognition of speech of real and synthetic image-sequences. Our results suggest that implicit perceptual discrimination is a more sensitive method for discrimination between synthetic and real image-sequences than explicit perceptual discrimination.
Resumo:
O presente artigo representa uma continuidade dos resultados apresentados em Camargo e Nardi (Revista Brasileira de Ensino de Física 29, 117 (2007)). Encontra-se inserido dentro de um estudo que busca compreender as principais barreiras para a inclusão de alunos com deficiência visual no contexto do ensino de física. Focalizando aulas de óptica, analisa as dificuldades comunicacionais entre licenciandos e discentes com deficiência visual. Para tal, enfatiza as estruturas empírica e semântico-sensorial das linguagens utilizadas, indicando fatores geradores de dificuldades de acessibilidade nas informações veiculadas. Recomenda, ainda, alternativas que visam dar condições à participação efetiva do discente com deficiência visual no processo comunicativo, das quais destacam-se: a identificação da estrutura semântico-sensorial dos significados veiculados, o conhecimento da história visual do aluno, a destituição da estrutura empírica audiovisual interdependente e a exploração das potencialidades comunicacionais das linguagens constituídas de estruturas empíricas de acesso visualmente independente. Conclui afirmando que a comunicação representa a principal barreira à participação efetiva de alunos com deficiência visual em aulas de óptica e enfatiza a importância da criação de canais comunicacionais adequados como condição básica à inclusão desses alunos.
Resumo:
TEMA: programa de remediação auditivo-visual computadorizado em escolares com dislexia do desenvolvimento. OBJETIVOS: verificar a eficácia de um programa de remediação auditivo-visual computadorizado em escolares com dislexia do desenvolvimento. Dentre os objetivos específicos, o estudo teve como finalidade comparar o desempenho cognitivo-lingüístico de escolares com dislexia do desenvolvimento com escolares bons leitores; comparar os achados dos procedimentos de avaliação de pré e pós testagem em escolares com dislexia submetidos e não submetidos ao programa; e, por fim, comparar os achados do programa de remediação em escolares com dislexia e escolares bons leitores submetidos ao programa de remediação. MÉTODO: participaram deste estudo 20 escolares, sendo o grupo I (GI) subdivido em: GIe, composto de cinco escolares com dislexia do desenvolvimento submetidos ao programa, e GIc, composto de cinco escolares com dislexia do desenvolvimento não submetidos ao programa. O grupo II (GII) foi subdividido em GIIe, composto de cinco escolares bons leitores submetidos à remediação, e GIIc, composto de cinco escolares bons leitores não submetidos à remediação. Foi realizado o programa de remediação auditivo-visual computadorizado Play-on. RESULTADOS: os resultados deste estudo revelaram que o GI apresentou desempenho inferior em habilidade de processamento auditivo e de consciência fonológica em comparação com o GII em situação de pré-testagem. Entretanto, o GIe apresentou desempenho semelhante ao GII em situação de pós-testagem, evidenciando a eficácia da remediação auditivo-visual em escolares com dislexia do desenvolvimento. CONCLUSÃO: o estudo evidenciou a eficácia do programa de remediação auditivo-visual em escolares com dislexia do desenvolvimento.
Resumo:
Autonomous robots must be able to learn and maintain models of their environments. In this context, the present work considers techniques for the classification and extraction of features from images in joined with artificial neural networks in order to use them in the system of mapping and localization of the mobile robot of Laboratory of Automation and Evolutive Computer (LACE). To do this, the robot uses a sensorial system composed for ultrasound sensors and a catadioptric vision system formed by a camera and a conical mirror. The mapping system is composed by three modules. Two of them will be presented in this paper: the classifier and the characterizer module. The first module uses a hierarchical neural network to do the classification; the second uses techiniques of extraction of attributes of images and recognition of invariant patterns extracted from the places images set. The neural network of the classifier module is structured in two layers, reason and intuition, and is trained to classify each place explored for the robot amongst four predefine classes. The final result of the exploration is the construction of a topological map of the explored environment. Results gotten through the simulation of the both modules of the mapping system will be presented in this paper. © 2008 IEEE.
Resumo:
Sistemas de reconhecimento e síntese de voz são constituídos por módulos que dependem da língua e, enquanto existem muitos recursos públicos para alguns idiomas (p.e. Inglês e Japonês), os recursos para Português Brasileiro (PB) ainda são escassos. Outro aspecto é que, para um grande número de tarefas, a taxa de erro dos sistemas de reconhecimento de voz atuais ainda é elevada, quando comparada à obtida por seres humanos. Assim, apesar do sucesso das cadeias escondidas de Markov (HMM), é necessária a pesquisa por novos métodos. Este trabalho tem como motivação esses dois fatos e se divide em duas partes. A primeira descreve o desenvolvimento de recursos e ferramentas livres para reconhecimento e síntese de voz em PB, consistindo de bases de dados de áudio e texto, um dicionário fonético, um conversor grafema-fone, um separador silábico e modelos acústico e de linguagem. Todos os recursos construídos encontram-se publicamente disponíveis e, junto com uma interface de programação proposta, têm sido usados para o desenvolvimento de várias novas aplicações em tempo-real, incluindo um módulo de reconhecimento de voz para a suíte de aplicativos para escritório OpenOffice.org. São apresentados testes de desempenho dos sistemas desenvolvidos. Os recursos aqui produzidos e disponibilizados facilitam a adoção da tecnologia de voz para PB por outros grupos de pesquisa, desenvolvedores e pela indústria. A segunda parte do trabalho apresenta um novo método para reavaliar (rescoring) o resultado do reconhecimento baseado em HMMs, o qual é organizado em uma estrutura de dados do tipo lattice. Mais especificamente, o sistema utiliza classificadores discriminativos que buscam diminuir a confusão entre pares de fones. Para cada um desses problemas binários, são usadas técnicas de seleção automática de parâmetros para escolher a representaçãao paramétrica mais adequada para o problema em questão.
Resumo:
Given the widespread use of computers, the visual pattern recognition task has been automated in order to address the huge amount of available digital images. Many applications use image processing techniques as well as feature extraction and visual pattern recognition algorithms in order to identify people, to make the disease diagnosis process easier, to classify objects, etc. based on digital images. Among the features that can be extracted and analyzed from images is the shape of objects or regions. In some cases, shape is the unique feature that can be extracted with a relatively high accuracy from the image. In this work we present some of most important shape analysis methods and compare their performance when applied on three well-known shape image databases. Finally, we propose the development of a new shape descriptor based on the Hough Transform.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.
Resumo:
OBJECTIVE To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI) users. METHODS Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM) sentence test. We presented video simulations using different video resolutions (1280 × 720, 640 × 480, 320 × 240, 160 × 120 px), frame rates (30, 20, 10, 7, 5 frames per second (fps)), speech velocities (three different speakers), webcameras (Logitech Pro9000, C600 and C500) and image/sound delays (0-500 ms). All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. RESULTS Higher frame rate (>7 fps), higher camera resolution (>640 × 480 px) and shorter picture/sound delay (<100 ms) were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009) in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11) showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032). CONCLUSION Webcameras have the potential to improve telecommunication of hearing-impaired individuals.
Resumo:
Desde la explosión de crecimiento de internet que comenzó en los años 90, se han ido creando y poniendo a disposición de los usuarios diversas herramientas para compartir información y servicios de diversas formas, desde el nacimiento del primer navegador hasta nuestros días, donde hay infinidad de lenguajes aplicables al ámbito web. En esta fase de crecimiento, en primer lugar, de cara a usuarios individuales, saldrían herramientas que permitirían a cada cual hacer su web personal, con sus contenidos expuestos. Más adelante se fue generando el fenómeno “comunidad”, con, por ejemplo, foros, o webs en las que había múltiples usuarios que disfrutaban de contenidos o servicios que la web ofreciese. Este crecimiento del mundo web en lo comunitario ha avanzado en muchas ramas,entre ellas, por supuesto, la educacional, surgiendo plataformas como la que es base del proyecto que a continuación se presenta, y herramienta básica y prácticamente ya imprescindible en la enseñanza universitaria: Moodle. Moodle es una herramienta diseñada para compatir recursos y diseñar actividades para el usuario potencial, complementando su aprendizaje en aula, o incluso siendo una vía autónoma de aprendizaje en sí misma. Se ha realizado un estudio sobre el estado de saludo de los contenidos que se exponen en Moodle, y se ha encontrado que una gran mayoría de los cursos que se pueden visitar tienen un gran número de carencias. Por un lado, hay pocos con material original explotado exclusivamente para el curso, y, si tienen material original, no se ha observado una especial atención por la maquetación. Por otro lado, hay muchos otros sin material original, y, en ambos casos, no se ha encontrado ningún curso que ofrezca material audiovisual exclusivo para el curso, presentando algunos en su lugar material audiovisual encontrado en la red (Youtube, etc). A la vista de estos hechos, se ha realizado un proyecto que intenta aportar soluciones ante estas carencias, y se presenta un curso procedente de diversas referencias bibliográficas, para la parte textual, y material audiovisual original e inédito que también se ha explotado específicamente para este curso. Este material ha sido por un lado vídeo, que se ha visionado, editado y subtitulado con software de libre distribución, y por otro lado, audio, que complementa un completo glosario que se ha añadido como extra al curso y cuyo planteamiento no se ha encontrado en ningún curso online de los revisados. Todo esto se ha envuelto en una maquetación cuidada que ha sido fruto del estudio de los lenguajes web html y CSS, de forma que, por un lado, el curso sea un lugar agradable en el que aprender dentro de internet, y por otro, se pudiesen realizar ciertas operaciones que sin estos conocimientos habrían sido imposibles, como la realización del glosario o la incrustación de imágenes y vídeos. A su vez, se ha tratado de dar un enfoque didáctico a toda la memoria del proyecto, de forma que pueda ser de utilidad a un usuario futuro que quisiese profundizar en los usos de Moodle, introducirse en el lenguaje web, o introducirse en el mundo de la edición de vídeo. ABSTRACT: Since the explosion of Internet growth beginning in the 90s, many tools have been created and made available for users to share information and services in various ways, from the birth of the first browser until today, where there are plenty of web programming languages. This growth stage would give individual users tools that would allow everyone to make an own personal website, with their contents exposed. Later, the "community" phenomenon appeared with, for example, forums, or websites where multiple users enjoyed the content or web services that those websites offered. Also, this growth in the web community world has progressed in many fields, including education, with the emerge of platforms such as the one that this project uses as its basis, and which is the basic and imperative tool in college education: Moodle. Moodle is a tool designed to share resources and design activities for the potential user, completing class learning, or even letting this user learn in an autonomous way. In this project a study on the current situation of the content present in Moodle courses around the net has been carried out, and it has been found that most of them lack of original material exploited exclusively for the courses, and if they have original material, there has been not observed concern on the layout where that material lies. On the other hand, there are many other with non original material, and in both cases, there has not been found any course that offers audio- visual material made specifically for the course, instead of presenting some audiovisual material found on the net (Youtube, etc). In view of these facts, the project presented here seeks to provide solutions to these shortcomings, presenting a course with original material exploited from various references, and unpublished audioevisual material which also has been exploited specifically for this course. This material is, on one hand, video, which has been viewed, edited and subtitled with free software, and on the other, audio, which complements a comprehensive glossary that has been added as an extra feature to the course and whose approach was not found in any of the online courses reviewed. All of this has been packaged in a neat layout that has been the result of the study of web languages HTML and CSS, so that first, the course was a pleasant place to learn on the internet, and second, certain operations could be performed which without this knowledge would have been impossible, as the glossary design or embedding images and videos. Furthermore, a didactic approach has been adopted to the entire project memory, so it can be useful to a future user who wanted to go deeper on the uses of Moodle, containing an intro into the web language, or in the world video editing.
Resumo:
MFCC coefficients extracted from the power spectral density of speech as a whole, seems to have become the de facto standard in the area of speaker recognition, as demonstrated by its use in almost all systems submitted to the 2013 Speaker Recognition Evaluation (SRE) in Mobile Environment [1], thus relegating to background this component of the recognition systems. However, in this article we will show that selecting the adequate speaker characterization system is as important as the selection of the classifier. To accomplish this we will compare the recognition rates achieved by different recognition systems that relies on the same classifier (GMM-UBM) but connected with different feature extraction systems (based on both classical and biometric parameters). As a result we will show that a gender dependent biometric parameterization with a simple recognition system based on GMM- UBM paradigm provides very competitive or even better recognition rates when compared to more complex classification systems based on classical features