984 resultados para Decoding Speech Prosody
Resumo:
Mode of access: Internet.
Resumo:
Progress in understanding brain/behavior relationships in adult-acquired dysprosody has led to models of cortical hemispheric representation of prosodic processing based on functional (linguistic vs affective) or physical (timing vs pitch) parameters. These explanatory perspectives have not been reconciled, and also a number of neurobehavior syndromes that include dysprosody among their neurological signs have not yet been integrated. In addition to expanding the functional perspective on prosody, some of these syndromes have implicated a significant role of subcortical nuclei in prosodic competence. In this article, two patients with acquired dysprosodic speech following damage to basal ganglia nuclei were evaluated using behavioral, acoustic, cognitive, and radiographic approaches. Selective quantitative measures were performed on each individual’s performance to provide detailed verification and clarification of clinical observations, and to test hypotheses regarding prosodic function. These studies, combined with a review of related clinical research findings, exemplify the value of a broader perspective on the neurobehavioral dysfunction underlying acquired adult dysprosodic speech, and lead to a new, proposed conceptual framework for the cerebral representation of prosody.
Resumo:
One of the overarching questions in the field of infant perceptual and cognitive development concerns how selective attention is organized during early development to facilitate learning. The following study examined how infants' selective attention to properties of social events (i.e., prosody of speech and facial identity) changes in real time as a function of intersensory redundancy (redundant audiovisual, nonredundant unimodal visual) and exploratory time. Intersensory redundancy refers to the spatially coordinated and temporally synchronous occurrence of information across multiple senses. Real time macro- and micro-structural change in infants' scanning patterns of dynamic faces was also examined. ^ According to the Intersensory Redundancy Hypothesis, information presented redundantly and in temporal synchrony across two or more senses recruits infants' selective attention and facilitates perceptual learning of highly salient amodal properties (properties that can be perceived across several sensory modalities such as the prosody of speech) at the expense of less salient modality specific properties. Conversely, information presented to only one sense facilitates infants' learning of modality specific properties (properties that are specific to a particular sensory modality such as facial features) at the expense of amodal properties (Bahrick & Lickliter, 2000, 2002). ^ Infants' selective attention and discrimination of prosody of speech and facial configuration was assessed in a modified visual paired comparison paradigm. In redundant audiovisual stimulation, it was predicted infants would show discrimination of prosody of speech in the early phases of exploration and facial configuration in the later phases of exploration. Conversely, in nonredundant unimodal visual stimulation, it was predicted infants would show discrimination of facial identity in the early phases of exploration and prosody of speech in the later phases of exploration. Results provided support for the first prediction and indicated that following redundant audiovisual exposure, infants showed discrimination of prosody of speech earlier in processing time than discrimination of facial identity. Data from the nonredundant unimodal visual condition provided partial support for the second prediction and indicated that infants showed discrimination of facial identity, but not prosody of speech. The dissertation study contributes to the understanding of the nature of infants' selective attention and processing of social events across exploratory time.^
Resumo:
Spoken term detection (STD) popularly involves performing word or sub-word level speech recognition and indexing the result. This work challenges the assumption that improved speech recognition accuracy implies better indexing for STD. Using an index derived from phone lattices, this paper examines the effect of language model selection on the relationship between phone recognition accuracy and STD accuracy. Results suggest that language models usually improve phone recognition accuracy but their inclusion does not always translate to improved STD accuracy. The findings suggest that using phone recognition accuracy to measure the quality of an STD index can be problematic, and highlight the need for an alternative that is more closely aligned with the goals of the specific detection task.
Resumo:
While spoken term detection (STD) systems based on word indices provide good accuracy, there are several practical applications where it is infeasible or too costly to employ an LVCSR engine. An STD system is presented, which is designed to incorporate a fast phonetic decoding front-end and be robust to decoding errors whilst still allowing for rapid search speeds. This goal is achieved through mono-phone open-loop decoding coupled with fast hierarchical phone lattice search. Results demonstrate that an STD system that is designed with the constraint of a fast and simple phonetic decoding front-end requires a compromise to be made between search speed and search accuracy.
Resumo:
This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 × RT conversational speech transcription system. © 2005 IEEE.
Resumo:
Accurate and fast decoding of speech imagery from electroencephalographic (EEG) data could serve as a basis for a new generation of brain computer interfaces (BCIs), more portable and easier to use. However, decoding of speech imagery from EEG is a hard problem due to many factors. In this paper we focus on the analysis of the classification step of speech imagery decoding for a three-class vowel speech imagery recognition problem. We empirically show that different classification subtasks may require different classifiers for accurately decoding and obtain a classification accuracy that improves the best results previously published. We further investigate the relationship between the classifiers and different sets of features selected by the common spatial patterns method. Our results indicate that further improvement on BCIs based on speech imagery could be achieved by carefully selecting an appropriate combination of classifiers for the subtasks involved.