803 resultados para acoustic and linguistic cues
Resumo:
For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.
Resumo:
Speech melody or prosody subserves linguistic, emotional, and pragmatic functions in speech communication. Prosodic perception is based on the decoding of acoustic cues with a predominant function of frequency-related information perceived as speaker's pitch. Evaluation of prosodic meaning is a cognitive function implemented in cortical and subcortical networks that generate continuously updated affective or linguistic speaker impressions. Various brain-imaging methods allow delineation of neural structures involved in prosody processing. In contrast to functional magnetic resonance imaging techniques, DC (direct current, slow) components of the EEG directly measure cortical activation without temporal delay. Activation patterns obtained with this method are highly task specific and intraindividually reproducible. Studies presented here investigated the topography of prosodic stimulus processing in dependence on acoustic stimulus structure and linguistic or affective task demands, respectively. Data obtained from measuring DC potentials demonstrated that the right hemisphere has a predominant role in processing emotions from the tone of voice, irrespective of emotional valence. However, right hemisphere involvement is modulated by diverse speech and language-related conditions that are associated with a left hemisphere participation in prosody processing. The degree of left hemisphere involvement depends on several factors such as (i) articulatory demands on the perceiver of prosody (possibly, also the poser), (ii) a relative left hemisphere specialization in processing temporal cues mediating prosodic meaning, and (iii) the propensity of prosody to act on the segment level in order to modulate word or sentence meaning. The specific role of top-down effects in terms of either linguistically or affectively oriented attention on lateralization of stimulus processing is not clear and requires further investigations.
Resumo:
Emotional processes modulate the size of the eyeblink startle reflex in a picture-viewing paradigm, but it is unclear whether emotional processes are responsible for blink modulation in human conditioning. Experiment 1 involved an aversive differential conditioning phase followed by an extinction phase in which acoustic startle probes were presented during CS+, CS-, and intertrial intervals. Valence ratings and affective priming showed the CS+ was unpleasant postacquisition. Blink startle magnitude was larger during CS+ than during CS-. Experiment 2 used the same design in two groups trained with pleasant or unpleasant pictorial USs. Ratings and affective priming indicated that the CS+ had become pleasant or unpleasant in the respective group. Regardless of CS valence, blink startle was larger during CS+ than CS- in both groups. Thus, startle was not modulated by CS valence.
Resumo:
In this paper, we introduce a novel high-level visual content descriptor which is devised for performing semantic-based image classification and retrieval. The work can be treated as an attempt to bridge the so called “semantic gap”. The proposed image feature vector model is fundamentally underpinned by the image labelling framework, called Collaterally Confirmed Labelling (CCL), which incorporates the collateral knowledge extracted from the collateral texts of the images with the state-of-the-art low-level image processing and visual feature extraction techniques for automatically assigning linguistic keywords to image regions. Two different high-level image feature vector models are developed based on the CCL labelling of results for the purposes of image data clustering and retrieval respectively. A subset of the Corel image collection has been used for evaluating our proposed method. The experimental results to-date already indicates that our proposed semantic-based visual content descriptors outperform both traditional visual and textual image feature models.
Resumo:
In an increasingly multilingual world, English language has kept a marked predominance as a global language. In many countries, English is the primary choice for foreign language learning. There is a long history of research in English language learning. The same applies for research in reading. A main interest since the 1970s has been the reading strategy defined as inferencing or guessing the meaning of unknown words from context. Inferencing has ben widely researched, however, the results and conclusions seem to be mixed. While some agree that inferencing is a useful strategy, others doubt its usefulness. Nevertheless, most of the research seem to agree that the cultural background affects comprehension and inferencing. While most of these studies have been done with texts and contexts created by the researches, little has been done using natural prose. The present study will attempt to further clarify the process of inferencing and the effects of the text’s cultural context and the linguistic background of the reader using a text that has not been created by the researcher. The participants of the study are 40 international students from Turku, Finland. Their linguistic background was obtained through a questionnaire and proved to be diverse. Think aloud protocols were performed to investigate their inferencing process and find connections between their inferences, comments, the text, and their linguistic background. The results show that: some inferences were made based on the participants’ world knowledge, experience, other languages, and English language knowledge; other inferences and comments were made based on the text, its use of language and vocabulary, and few cues provided by the author. The results from the present study and previous research seem to show that: 1) linguistic background is a source of information for inferencing but is not a major source; 2) the cultural context of the text affected the inferences made by the participants according to their closeness or distance from it.
Resumo:
Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.
Resumo:
Segmentation of novel or dynamic objects in a scene, often referred to as background sub- traction or foreground segmentation, is critical for robust high level computer vision applica- tions such as object tracking, object classifca- tion and recognition. However, automatic real- time segmentation for robotics still poses chal- lenges including global illumination changes, shadows, inter-re ections, colour similarity of foreground to background, and cluttered back- grounds. This paper introduces depth cues provided by structure from motion (SFM) for interactive segmentation to alleviate some of these challenges. In this paper, two prevailing interactive segmentation algorithms are com- pared; Lazysnapping [Li et al., 2004] and Grab- cut [Rother et al., 2004], both based on graph- cut optimisation [Boykov and Jolly, 2001]. The algorithms are extended to include depth cues rather than colour only as in the original pa- pers. Results show interactive segmentation based on colour and depth cues enhances the performance of segmentation with a lower er- ror with respect to ground truth.
Resumo:
This thesis investigates the fusion of 3D visual information with 2D image cues to provide 3D semantic maps of large-scale environments in which a robot traverses for robotic applications. A major theme of this thesis was to exploit the availability of 3D information acquired from robot sensors to improve upon 2D object classification alone. The proposed methods have been evaluated on several indoor and outdoor datasets collected from mobile robotic platforms including a quadcopter and ground vehicle covering several kilometres of urban roads.
Resumo:
Isolating processes within the brain that are specific to human behavior is a key goal for social neuroscience. The current research was an attempt to test whether recent findings of enhanced negative ERPs in response to unexpected human gaze are unique to eye gaze stimuli by comparing the effects of gaze cues with the effects of an arrow cue. ERPs were recorded while participants (N¼30) observed a virtual actor or an arrow that gazed (or pointed) either toward (object congruent) or away from (object incongruent) a flashing checkerboard. An enhanced negative ERP (N300) in response to object incongruent compared to object congruent trials was recorded for both eye gaze and arrow stimuli. The findings are interpreted as reflecting a domain general mechanism for detecting unexpected events.
Resumo:
This thesis examined the extent to which individual differences, as conceptualised by the revised Reinforcement Sensitivity Theory, influenced young drivers' information processing and subsequent acceptance of anti-speeding messages. Using a multi-method approach, the findings highlighted the utility of combining objective measures (a cognitive response time task and electroencephalography) with self-report measures to assess message processing and message acceptance, respectively. This body of research indicated that responses to anti-speeding messages may differ depending on an individual's personality disposition. Overall, the research provided further insight into the development of message strategies to target high risk drivers.
Resumo:
Localization of technology is now widely applied to the preservation and revival of the culture of indigenous peoples around the world, most commonly through the translation into indigenous languages, which has been proven to increase the adoption of technology. However, this current form of localization excludes two demographic groups, which are key to the effectiveness of localization efforts in the African context: the younger generation (under the age of thirty) with an Anglo- American cultural view who have no need or interest in their indigenous culture; and the older generation (over the age of fifty) who are very knowledgeable about their indigenous culture, but have little or no knowledge on the use of a computer. This paper presents the design of a computer game engine that can be used to provide an interface for both technology and indigenous culture learning for both generations. Four indigenous Ugandan games are analyzed and identified for their attractiveness to both generations, to both rural and urban populations, and for their propensity to develop IT skills in older generations.
Resumo:
Teachers in the Pacific region have often signalled the need for more locally produced information texts in both the vernacular and English, to engage their readers with local content and to support literacy development across the curriculum. The Information Text Awareness Project (ITAP), initially informed by the work of Nea Stewart-Dore, has provided a means to address this need through supporting local teachers to write their own information texts. The article reports on the impact of an ITAP workshop carried out in Nadi, Fiji in 2012. Nine teacher volunteers from the project trialled the use of the texts in their classrooms with positive results in relation to student learning and belief in themselves as writers.