904 resultados para audio-visual automatic speech recognition
Resumo:
The Colloquium on Human-Machine Communication by Voice highlighted the global technical community's focus on the problems and promise of voice-processing technology, particularly, speech recognition and speech synthesis. Clearly, there are many areas in both the research and development of these technologies that can be advanced significantly. However, it is also true that there are many applications of these technologies that are capable of commercialization now. Early successful commercialization of new technology is vital to ensure continuing interest in its development. This paper addresses efforts to commercialize speech technologies in two markets: telecommunications and aids for the handicapped.
Resumo:
This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will require a varying mix of advances in speech technology and in integration of the technology into applications environments. Applications that are described include (1) speech recognition and synthesis for mobile command and control; (2) speech processing for a portable multifunction soldier's computer; (3) speech- and language-based technology for naval combat team tactical training; (4) speech technology for command and control on a carrier flight deck; (5) control of auxiliary systems, and alert and warning generation, in fighter aircraft and helicopters; and (6) voice check-in, report entry, and communication for law enforcement agents or special forces. A phased approach for transfer of the technology into applications is advocated, where integration of applications systems is pursued in parallel with advanced research to meet future needs.
Resumo:
The deployment of systems for human-to-machine communication by voice requires overcoming a variety of obstacles that affect the speech-processing technologies. Problems encountered in the field might include variation in speaking style, acoustic noise, ambiguity of language, or confusion on the part of the speaker. The diversity of these practical problems encountered in the "real world" leads to the perceived gap between laboratory and "real-world" performance. To answer the question "What applications can speech technology support today?" the concept of the "degree of difficulty" of an application is introduced. The degree of difficulty depends not only on the demands placed on the speech recognition and speech synthesis technologies but also on the expectations of the user of the system. Experience has shown that deployment of effective speech communication systems requires an iterative process. This paper discusses general deployment principles, which are illustrated by several examples of human-machine communication systems.
Resumo:
This paper describes the state of the art in applications of voice-processing technologies. In the first part, technologies concerning the implementation of speech recognition and synthesis algorithms are described. Hardware technologies such as microprocessors and DSPs (digital signal processors) are discussed. Software development environment, which is a key technology in developing applications software, ranging from DSP software to support software also is described. In the second part, the state of the art of algorithms from the standpoint of applications is discussed. Several issues concerning evaluation of speech recognition/synthesis algorithms are covered, as well as issues concerning the robustness of algorithms in adverse conditions.
Resumo:
This talk, which was the keynote address of the NAS Colloquium on Human-Machine Communication by Voice, discusses the past, present, and future of human-machine communications, especially speech recognition and speech synthesis. Progress in these technologies is reviewed in the context of the general progress in computer and communications technologies.
Resumo:
A perda auditiva no idoso acarreta em dificuldade na percepção da fala. O teste comumente utilizado na logoaudiometria é a pesquisa do índice de reconhecimento de fala máximo (IR-Max) em uma única intensidade de apresentação da fala. Entretanto, o procedimento mais adequado seria a realização do teste em diversas intensidades, visto que o índice de acerto depende da intensidade da fala no momento do teste e está relacionado com o grau e configuração da perda auditiva. A imprecisão na obtenção do IR-Max poderá gerar uma hipótese diagnóstica errônea e o insucesso no processo de intervenção na perda auditiva. Objetivo: Verificar a interferência do nível de apresentação da fala, no teste de reconhecimento de fala, em idosos com perda auditiva sensorioneural com diferentes configurações audiométricas. Métodos: Participaram 64 idosos, 120 orelhas (61 do gênero feminino e 59 do gênero masculino), idade entre 60 e 88 anos, divididos em grupos: G1- composto por 23 orelhas com configuração horizontal, G2- 55 orelhas com configuração descendente, G3- 42 orelhas com configuração abrupta. Os critérios de inclusão foram: perda auditiva sensorioneural de grau leve a severo, não usuário de aparelho de amplificação sonora individual (AASI), ou com tempo de uso inferior a dois meses, e ausência de alterações cognitivas. Foram realizados os seguintes procedimentos: pesquisas do limiar de reconhecimento de fala (LRF), do índice de reconhecimento de fala (IRF) em diversas intensidades e do nível de máximo conforto (MCL) e desconforto (UCL) para a fala. Para tal, foram utilizadas listas com 11 monossílabos, para diminuir a duração do teste. A análise estatística foi composta pelo teste Análise de Variância (ANOVA) e teste de Tukey. Resultados: A configuração descendente foi a de maior ocorrência. Indivíduos com configuração horizontal apresentaram índice médio de acerto mais elevado de reconhecimento de fala. Ao considerar o total avaliado, 27,27% dos indivíduos com configuração horizontal revelaram o IR-Max no MCL, assim como 38,18% com configuração descendente e 26,19% com configuração abrupta. O IR-Max foi encontrado no UCL, em 40,90% dos indivíduos com configuração horizontal, 45,45% com configuração descendente e 28,20% com configuração abrupta. Respectivamente, o maior e o menor índice médio de acerto foram encontrados em: G1- 30 e 40 dBNS; G2- 50 e 10 dBNS; G3- 45 e 10 dBNS. Não há uma única intensidade de fala a ser utilizada em todos os tipos de configurações audiométricas, entretanto, os níveis de sensação que identificaram os maiores índices médios de acerto foram: G1- 20 a 30 dBNS, G2- 20 a 50 dBNS; G3- 45 dBNS. O MCL e o UCL-5 dB para a fala não foram eficazes para determinar o IR-Max. Conclusões: O nível de apresentação teve influência no desempenho no reconhecimento de fala para monossílabos em idosos com perda auditiva sensorioneural em todas as configurações audiométricas. A perda auditiva de grau moderado e a configuração audiométrica descendente foram mais frequentes nessa população, seguida da abrupta e horizontal.
Resumo:
A televisão nos dias atuais tem sofrido inúmeras inovações tecnológicas nos campos das transmissões multimídia, qualidade audio-visual e diversidade de funcionalidades. Entretanto, esta essencialmente mantêm sua característica de fornecer informações de forma quase que instantânea à população. O ambiente atual da televisão digital é caracterizado pela coexistência de inúmeros dispositivos capazes de oferecerem uma experiência televisa, associando-se computadores pessoais, smartphones, tablets e outros eletrônicos de consumo. Ainda, pode se incluir a este cenário a disponibilidade de inúmeras redes de transporte de dados tais como a radiodifusão, satélite, cabo e redes em banda larga. Este cenário diversificado, em termos de dispositivos e redes, é denominado de cenário de televisão digital híbrida, a qual destaca-se a interação do expectador com os diversos dispositivos. Estes cenários, por sua vez, motivam o desenvolvimento de tecnologias que permitem o aperfeiçoamento da pervasividade e dos meios pelos os quais os aplicativos possam ser suportados em diferentes plataformas. Este trabalho propõe ambientes interoperáveis envolvendo a televisão digital interativa e outros eletrônicos de consumo, aos quais foram realizados estudos e experimentos para se observar diferentes técnicas de sincronização e comunicação entre plataformas de interatividade para a televisão digital híbrida. Os resultados apontam para a possibilidade de cenários interoperáveis envolvendo o uso de marcadores e também recursos de redes e serviços TCP/IP, levando em consideração a eficiência e eficácia nos diferentes métodos. Conclui-se que os resultados odem motivar o desenvolvimento de cenários diferenciados envolvendo a televisão digital interativa e dispositivos de segunda tela, o que incrementa a interatividade e as formas de entretenimento.
Resumo:
From a gender perspective, protection and advertising political actions about work-family should promote sharing responsibilities between sexes. Next to political action and specific measures, the project of equal opportunities needs a long-term strategy based on the education on equality. This article proposes the methodologic exposition of a study based on these premises. It facilitates and explains the protocol used for the analysis of the audio-visual advertising campaigns on conciliation emitted by the Woman’s Institute. The evaluation of the actions is focused on the effectiveness from the point of view of mass media. It provides some data that illustrates the proposed study. Finally, it considers the difficulties of the available sources of information.
Resumo:
In this paper we introduce a probabilistic approach to support visual supervision and gesture recognition. Task knowledge is both of geometric and visual nature and it is encoded in parametric eigenspaces. Learning processes for compute modal subspaces (eigenspaces) are the core of tracking and recognition of gestures and tasks. We describe the overall architecture of the system and detail learning processes and gesture design. Finally we show experimental results of tracking and recognition in block-world like assembling tasks and in general human gestures.
Resumo:
But: La perte unilatérale du cortex visuel postérieur engendre une cécité corticale controlatérale à la lésion, qu’on appelle hémianopsie homonyme (HH). Celle-ci est notamment accompagnée de problèmes d’exploration visuelle dans l’hémichamp aveugle dus à des stratégies oculaires déficitaires, qui ont été la cible des thérapies de compensation. Or, cette perte de vision peut s’accompagner d’une perception visuelle inconsciente, appelée blindsight. Notre hypothèse propose que le blindsight soit médié par la voie rétino-colliculaire extrastriée, recrutant le colliculus supérieur (CS), une structure multisensorielle. Notre programme a pour objectif d’évaluer l’impact d’un entraînement multisensoriel (audiovisuel) sur la performance visuelle inconsciente des personnes hémianopsiques et les stratégies oculaires. Nous essayons, ainsi, de démontrer l’implication du CS dans le phénomène de blindsight et la pertinence de la technique de compensation multisensorielle comme thérapie de réadaptation. Méthode: Notre participante, ML, atteinte d’une HH droite a effectué un entraînement d’intégration audiovisuel pour une période de 10 jours. Nous avons évalué la performance visuelle en localisation et en détection ainsi que les stratégies oculaires selon trois comparaisons principales : (1) entre l’hémichamp normal et l’hémichamp aveugle; (2) entre la condition visuelle et les conditions audiovisuelles; (3) entre les sessions de pré-entraînement, post-entraînement et 3 mois post-entraînement. Résultats: Nous avons démontré que (1) les caractéristiques des saccades et des fixations sont déficitaires dans l’hémichamp aveugle; (2) les stratégies saccadiques diffèrent selon les excentricités et les conditions de stimulations; (3) une adaptation saccadique à long terme est possible dans l’hémichamp aveugle si l’on considère le bon cadre de référence; (4) l’amélioration des mouvements oculaires est liée au blindsight. Conclusion(s): L’entraînement multisensoriel conduit à une amélioration de la performance visuelle pour des cibles non perçues, tant en localisation qu’en détection, ce qui est possiblement induit par le développement de la performance oculomotrice.
Resumo:
Cette thèse contribue a la recherche vers l'intelligence artificielle en utilisant des méthodes connexionnistes. Les réseaux de neurones récurrents sont un ensemble de modèles séquentiels de plus en plus populaires capable en principe d'apprendre des algorithmes arbitraires. Ces modèles effectuent un apprentissage en profondeur, un type d'apprentissage machine. Sa généralité et son succès empirique en font un sujet intéressant pour la recherche et un outil prometteur pour la création de l'intelligence artificielle plus générale. Le premier chapitre de cette thèse donne un bref aperçu des sujets de fonds: l'intelligence artificielle, l'apprentissage machine, l'apprentissage en profondeur et les réseaux de neurones récurrents. Les trois chapitres suivants couvrent ces sujets de manière de plus en plus spécifiques. Enfin, nous présentons quelques contributions apportées aux réseaux de neurones récurrents. Le chapitre \ref{arxiv1} présente nos travaux de régularisation des réseaux de neurones récurrents. La régularisation vise à améliorer la capacité de généralisation du modèle, et joue un role clé dans la performance de plusieurs applications des réseaux de neurones récurrents, en particulier en reconnaissance vocale. Notre approche donne l'état de l'art sur TIMIT, un benchmark standard pour cette tâche. Le chapitre \ref{cpgp} présente une seconde ligne de travail, toujours en cours, qui explore une nouvelle architecture pour les réseaux de neurones récurrents. Les réseaux de neurones récurrents maintiennent un état caché qui représente leurs observations antérieures. L'idée de ce travail est de coder certaines dynamiques abstraites dans l'état caché, donnant au réseau une manière naturelle d'encoder des tendances cohérentes de l'état de son environnement. Notre travail est fondé sur un modèle existant; nous décrivons ce travail et nos contributions avec notamment une expérience préliminaire.
Resumo:
BACKGROUND Resuscitation guidelines encourage the use of cardiopulmonary resuscitation (CPR) feedback devices implying better outcomes after sudden cardiac arrest. Whether effective continuous feedback could also be given verbally by a second rescuer ("human feedback") has not been investigated yet. We, therefore, compared the effect of human feedback to a CPR feedback device. METHODS In an open, prospective, randomised, controlled trial, we compared CPR performance of three groups of medical students in a two-rescuer scenario. Group "sCPR" was taught standard BLS without continuous feedback, serving as control. Group "mfCPR" was taught BLS with mechanical audio-visual feedback (HeartStart MRx with Q-CPR-Technology™). Group "hfCPR" was taught standard BLS with human feedback. Afterwards, 326 medical students performed two-rescuer BLS on a manikin for 8 min. CPR quality parameters, such as "effective compression ratio" (ECR: compressions with correct hand position, depth and complete decompression multiplied by flow-time fraction), and other compression, ventilation and time-related parameters were assessed for all groups. RESULTS ECR was comparable between the hfCPR and the mfCPR group (0.33 vs. 0.35, p = 0.435). The hfCPR group needed less time until starting chest compressions (2 vs. 8 s, p < 0.001) and showed fewer incorrect decompressions (26 vs. 33 %, p = 0.044). On the other hand, absolute hands-off time was higher in the hfCPR group (67 vs. 60 s, p = 0.021). CONCLUSIONS The quality of CPR with human feedback or by using a mechanical audio-visual feedback device was similar. Further studies should investigate whether extended human feedback training could further increase CPR quality at comparable costs for training.
Resumo:
BACKGROUND Screening of aphasia in acute stroke is crucial for directing patients to early language therapy. The Language Screening Test (LAST), originally developed in French, is a validated language screening test that allows detection of a language deficit within a few minutes. The aim of the present study was to develop and validate two parallel German versions of the LAST. METHODS The LAST includes subtests for naming, repetition, automatic speech, and comprehension. For the translation into German, task constructs and psycholinguistic criteria for item selection were identical to the French LAST. A cohort of 101 stroke patients were tested, all of whom were native German speakers. Validation of the LAST was based on (1) analysis of equivalence of the German versions, which was established by administering both versions successively in a subset of patients, (2) internal validity by means of internal consistency analysis, and (3) external validity by comparison with the short version of the Token Test in another subset of patients. RESULTS The two German versions were equivalent as demonstrated by a high intraclass correlation coefficient of 0.91. Furthermore, an acceptable internal structure of the LAST was found (Cronbach's α = 0.74). A highly significant correlation (r = 0.74, p < 0.0001) between the LAST and the short version of the Token Test indicated good external validity of the scale. CONCLUSION The German version of the LAST, available in two parallel versions, is a new and valid language screening test in stroke.
Resumo:
Cette thèse contribue a la recherche vers l'intelligence artificielle en utilisant des méthodes connexionnistes. Les réseaux de neurones récurrents sont un ensemble de modèles séquentiels de plus en plus populaires capable en principe d'apprendre des algorithmes arbitraires. Ces modèles effectuent un apprentissage en profondeur, un type d'apprentissage machine. Sa généralité et son succès empirique en font un sujet intéressant pour la recherche et un outil prometteur pour la création de l'intelligence artificielle plus générale. Le premier chapitre de cette thèse donne un bref aperçu des sujets de fonds: l'intelligence artificielle, l'apprentissage machine, l'apprentissage en profondeur et les réseaux de neurones récurrents. Les trois chapitres suivants couvrent ces sujets de manière de plus en plus spécifiques. Enfin, nous présentons quelques contributions apportées aux réseaux de neurones récurrents. Le chapitre \ref{arxiv1} présente nos travaux de régularisation des réseaux de neurones récurrents. La régularisation vise à améliorer la capacité de généralisation du modèle, et joue un role clé dans la performance de plusieurs applications des réseaux de neurones récurrents, en particulier en reconnaissance vocale. Notre approche donne l'état de l'art sur TIMIT, un benchmark standard pour cette tâche. Le chapitre \ref{cpgp} présente une seconde ligne de travail, toujours en cours, qui explore une nouvelle architecture pour les réseaux de neurones récurrents. Les réseaux de neurones récurrents maintiennent un état caché qui représente leurs observations antérieures. L'idée de ce travail est de coder certaines dynamiques abstraites dans l'état caché, donnant au réseau une manière naturelle d'encoder des tendances cohérentes de l'état de son environnement. Notre travail est fondé sur un modèle existant; nous décrivons ce travail et nos contributions avec notamment une expérience préliminaire.
Resumo:
Mode of access: Internet.