945 resultados para Audio-visual speaker recognition


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVE To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI) users. METHODS Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM) sentence test. We presented video simulations using different video resolutions (1280 × 720, 640 × 480, 320 × 240, 160 × 120 px), frame rates (30, 20, 10, 7, 5 frames per second (fps)), speech velocities (three different speakers), webcameras (Logitech Pro9000, C600 and C500) and image/sound delays (0-500 ms). All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. RESULTS Higher frame rate (>7 fps), higher camera resolution (>640 × 480 px) and shorter picture/sound delay (<100 ms) were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009) in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11) showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032). CONCLUSION Webcameras have the potential to improve telecommunication of hearing-impaired individuals.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper predicts speech synthesis, speech recognition, and speaker recognition technology for the year 2001, and it describes the most important research problems to be solved in order to arrive at these ultimate synthesis and recognition systems. The problems for speech synthesis include natural and intelligible voice production, prosody control based on meaning, capability of controlling synthesized voice quality and choosing individual speaking style, multilingual and multidialectal synthesis, choice of application-oriented speaking styles, capability of adding emotion, and synthesis from concepts. The problems for speech recognition include robust recognition against speech variations, adaptation/normalization to variations due to environmental conditions and speakers, automatic knowledge acquisition for acoustic and linguistic modeling, spontaneous speech recognition, naturalness and ease of human-machine interaction, and recognition of emotion. The problems for speaker recognition are similar to those for speech recognition. The research topics related to all these techniques include the use of articulatory and perceptual constraints and evaluation methods for measuring the quality of technology and systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Las teorías cognitivas han demostrado que el pensamiento humano se encuentra corporeizado; es decir, que accedemos a la realidad mediante nuestros sentidos y no podemos huir de ellos. Para entender y manejar conceptos abstractos utilizamos proyecciones metafóricas basadas en sensaciones corporales. De ahí la ubicuidad de la metáfora en el lenguaje cotidiano. Aunque esta afirmación ha sido ampliamente probada con el análisis del corpus verbal en distintas lenguas, apenas existen investigaciones en el corpus audiovisual. Si las metáforas primarias forman parte de nuestro inconsciente cognitivo, son inherentes al ser humano y consecuencia de la naturaleza del cerebro, deben generar también metáforas visuales. En este artículo, se analizan y discuten una serie de ejemplos para comprobarlo.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Bibliography: p. 41.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we consider the task of recognizing epigraphs in images such as photos taken using mobile devices. Given a set of 17,155 photos related to 14,560 epigraphs, we used a k-NearestNeighbor approach in order to perform the recognition. The contribution of this work is in evaluating state-of-the-art visual object recognition techniques in this specific context. The experimental results conducted show that Vector of Locally Aggregated Descriptors obtained aggregating SIFT descriptors is the best choice for this task.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The production and perception of music is a multimodal activity involving auditory, visual and conceptual processing, integrating these with prior knowledge and environmental experience. Musicians utilise expressive physical nuances to highlight salient features of the score. The question arises within the literature as to whether performers’ non-technical, non-sound-producing movements may be communicatively meaningful and convey important structural information to audience members and co-performers. In the light of previous performance research (Vines et al., 2006, Wanderley, 2002, Davidson, 1993), and considering findings within co-speech gestural research and auditory and audio-visual neuroscience, this thesis examines the nature of those movements not directly necessary for the production of sound, and their particular influence on audience perception. Within the current research 3D performance analysis is conducted using the Vicon 12- camera system and Nexus data-processing software. Performance gestures are identified as repeated patterns of motion relating to music structure, which not only express phrasing and structural hierarchy but are consistently and accurately interpreted as such by a perceiving audience. Gestural characteristics are analysed across performers and performance style using two Chopin preludes selected for their diverse yet comparable structures (Opus 28:7 and 6). Effects on perceptual judgements of presentation modes (visual-only, auditory-only, audiovisual, full- and point-light) and viewing conditions are explored. This thesis argues that while performance style is highly idiosyncratic, piano performers reliably generate structural gestures through repeated patterns of upper-body movement. The shapes and locations of phrasing motions are identified particular to the sample of performers investigated. Findings demonstrate that despite the personalised nature of the gestures, performers use increased velocity of movements to emphasise musical structure and that observers accurately and consistently locate phrasing junctures where these patterns and variation in motion magnitude, shape and velocity occur. By viewing performance motions in polar (spherical) rather than cartesian coordinate space it is possible to get mathematically closer to the movement generated by each of the nine performers, revealing distinct patterns of motion relating to phrasing structures, regardless of intended performance style. These patterns are highly individualised both to each performer and performed piece. Instantaneous velocity analysis indicates a right-directed bias of performance motion variation at salient structural features within individual performances. Perceptual analyses demonstrate that audience members are able to accurately and effectively detect phrasing structure from performance motion alone. This ability persists even for degraded point-light performances, where all extraneous environmental information has been removed. The relative contributions of audio, visual and audiovisual judgements demonstrate that the visual component of a performance does positively impact on the over- all accuracy of phrasing judgements, indicating that receivers are most effective in their recognition of structural segmentations when they can both see and hear a performance. Observers appear to make use of a rapid online judgement heuristics, adjusting response processes quickly to adapt and perform accurately across multiple modes of presentation and performance style. In line with existent theories within the literature, it is proposed that this processing ability may be related to cognitive and perceptual interpretation of syntax within gestural communication during social interaction and speech. Findings of this research may have future impact on performance pedagogy, computational analysis and performance research, as well as potentially influencing future investigations of the cognitive aspects of musical and gestural understanding.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Notre mémoire prend en charge de re-conceptualiser notre nouvel environnement audio-visuel et l’expérience que nous en faisons. À l’ère du numérique et de la dissémination généralisée des images animées, nous circonscrivons une catégorie d’images que nous concevons comme la plus à même d’avoir un impact sur le développement humain. Nous les appelons des images-sons synchrono-photo-temporalisées. Plus spécifiquement, nous cherchons à mettre en lumière leur puissance d’affection et de contrôle en démontrant qu’elles ont une influence certaine sur le processus d’individuation, influence qui est grandement facilitée par l’isotopie structurelle qui existe entre le flux de conscience et leur flux d’écoulement. Par le biais des recherches de Bernard Stiegler, nous remarquons également l’important rôle que jouent l’attention et la mémoire dans le processus d’individuation. L’ensemble de notre réflexion nous fait réaliser à quel point le système d’éducation actuel québécois manque à sa tâche de formation citoyenne en ne dispensant pas un enseignement adéquat des images animées.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Notre mémoire prend en charge de re-conceptualiser notre nouvel environnement audio-visuel et l’expérience que nous en faisons. À l’ère du numérique et de la dissémination généralisée des images animées, nous circonscrivons une catégorie d’images que nous concevons comme la plus à même d’avoir un impact sur le développement humain. Nous les appelons des images-sons synchrono-photo-temporalisées. Plus spécifiquement, nous cherchons à mettre en lumière leur puissance d’affection et de contrôle en démontrant qu’elles ont une influence certaine sur le processus d’individuation, influence qui est grandement facilitée par l’isotopie structurelle qui existe entre le flux de conscience et leur flux d’écoulement. Par le biais des recherches de Bernard Stiegler, nous remarquons également l’important rôle que jouent l’attention et la mémoire dans le processus d’individuation. L’ensemble de notre réflexion nous fait réaliser à quel point le système d’éducation actuel québécois manque à sa tâche de formation citoyenne en ne dispensant pas un enseignement adéquat des images animées.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The aim of this Study was to compare the learning process of a highly complex ballet skill following demonstrations of point light and video models 16 participants divided into point light and video groups (ns = 8) performed 160 trials of a pirouette equally distributed in blocks of 20 trials alternating periods of demonstration and practice with a retention test a day later Measures of head and trunk oscillation coordination d1 parity from the model and movement time difference showed similarities between video and point light groups ballet experts evaluations indicated superiority of performance in the video over the point light group Results are discussed in terms of the task requirements of dissociation between head and trunk rotations focusing on the hypothesis of sufficiency and higher relevance of information contained in biological motion models applied to learning of complex motor skills

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Previous studies have shown that multiple ; birth children (MBC) are prone to early phonological ;difficulties and later literacy problems. However, to date, ;there has been no systematic long-term follow-up of MBC with phonological difficulties in the preschool years to determine whether these difficulties predict later literacy problems. In this study, 20 MBC whose early speech and language skills had been previously documented were compared to normative data and 20 singleton controls on tasks assessing phonological ; processing and literacy. The major findings indicated that MBC performed significantly more poorly on some tasks :df phonological processing than singleton controls did. Further, the early phonological skills of MBC (i.e., the number of inappropriate phonological processes used) correlated with poor performance on visual rhyme recognition, word repetition, and phoneme detection tasks 5 years later. There was no significant relationship between early biological factors (birth weight and gestation period) and performance on the phonological processing and literacy-related subtests. These results cl-support the hypothesis that MBC's early speech and language difficulties are not merely a transient phase;of; development, but a real disorder, with consequences for later academic achievement.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In Experiment 1, color-naming interference for target stimuli following associated primes was greater in a group making a lexical decision to the prime than in a group reading the prime silently. High-frequency targets were responded to more quickly than low-frequency targets. In Experiment 2, with subjects naming the prime, there was evidence of associative interference when the prime and the target were grouped temporally but not when the intertrial interval was comparable with the prime-target interval. Associative primes presented at a short (120-msec) prime-target stimulus onset asynchrony facilitated color naming in Experiment 3. Taken together, the results suggest that the effect of faster processing of the base word in a color-naming task is facilitatory and that color-naming priming interference arises when associative prime processing increases conflict between word and color responses by enhancing phonological or articulatory activation of the base word.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

University students spelled low-frequency words to dictation and subsequently made lexical decisions to them. In Experiment I, lexical decisions were slower on words students had spelled incorrectly relative to words they had spelled correctly, and there A as a larger repetition benefit 101 incorrectly spelled words. In experiment 2, the latency advantage for items spelled correctly was replicated when words were presented for only 200 ms and also in a spelling recognition task, In Experiment 3. masked identity and form priming effects were similar for words that had been spelled correctly and incorrectly, Item spelling accuracy tracked word frequency effects in the way chat it combined with repetition and priming effects. we inter that an individuals learning with a word's orthography underlies word frequency and item spelling accuracy effects and that a single orthographic lexicon serves visual word recognition and spelling. (C) 2000 Elsevier Science (USA).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Esta pesquisa investiga a relação entre os repertórios de ação coletiva adotados por organizações de movimentos sociais e a efetividade das instituições participativas (IPs) que tratam das políticas de comunicações no Brasil, ou seja, o Conselho de Comunicação Social do Congresso Nacional (CCS) e a 1ª Conferência Nacional de Comunicação (ConfeCom). A discussão gira em torno das ações implementadas pelo Coletivo Intervozes, organização da sociedade civil que atua nos movimentos sociais em prol do direito à comunicação e de sua democratização. Nesse contexto, dá-se ênfase às ações por um novo marco legal e regulatório das comunicações, consideradas como resultado dos problemas de efetividade observados no CCS e na ConfeCom. O trabalho está dividido em quatro capítulos. No primeiro, o destaque é para o Coletivo Intervozes, sua história, forma de organização, além de seus principais eixos de atuação e ações. No segundo, essencialmente teórico, enfatizam-se as definições conceituais que envolvem os movimentos sociais e a mudança institucional. O capítulo 3 é dedicado à análise dos problemas de efetividade nas IPs atinentes à área de comunicações e suas relações com os repertórios de ação coletiva. Como variáveis de análise, utiliza-se o acesso/representação da sociedade civil e as funções atribuídas às IPs. No último capítulo, analisa-se as características do movimento social que reivindica um novo marco legal e regulatório das comunicações e que surgiu como ação alternativa às IPs na defesa de mudanças institucionais para o setor. Como esta é uma pesquisa qualitativa, as análises foram feitas a partir de entrevistas semiestruturadas com membros do Coletivo Intervozes e especialistas da área; de acesso a documentos públicos produzidos pela organização e a dados bibliográficos, audiovisuais e sonoros referentes ao CCS e à ConfeCom.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação apresentada à Escola Superior de Comunicação Social como parte dos requisitos para obtenção de grau de mestre em Audiovisual e Multimédia.