904 resultados para Audio-Visual Automatic Speech Recognition
Resumo:
Research into visual hallucinations has accelerated over the last decade from around 350 publications per year in 2000 to over 500 in 2010. Increased recognition of the frequent occurrence of visual hallucinations in a number of common disorders, coupled with improvements in the measurement of phenomenology, and more sophisticated imaging techniques have allowed the development and initial testing of sophisticated models. However, key questions remain unanswered. Amongst these are: whether there is a satisfactory definition of hallucinations in a constructive visual system; whether there are one, two or several core varieties of hallucinations; what are the underlying brain mechanisms for hallucinations; and what, if anything, can be done to treat them when they lead to distress? Looking across research in several clinical areas suggests a tentative integrative model that allows the possibility of answering these questions, but much work remains to be done.
Resumo:
Speech is often a multimodal process, presented audiovisually through a talking face. One area of speech perception influenced by visual speech is speech segmentation, or the process of breaking a stream of speech into individual words. Mitchel and Weiss (2013) demonstrated that a talking face contains specific cues to word boundaries and that subjects can correctly segment a speech stream when given a silent video of a speaker. The current study expanded upon these results, using an eye tracker to identify highly attended facial features of the audiovisual display used in Mitchel and Weiss (2013). In Experiment 1, subjects were found to spend the most time watching the eyes and mouth, with a trend suggesting that the mouth was viewed more than the eyes. Although subjects displayed significant learning of word boundaries, performance was not correlated with gaze duration on any individual feature, nor was performance correlated with a behavioral measure of autistic-like traits. However, trends suggested that as autistic-like traits increased, gaze duration of the mouth increased and gaze duration of the eyes decreased, similar to significant trends seen in autistic populations (Boratston & Blakemore, 2007). In Experiment 2, the same video was modified so that a black bar covered the eyes or mouth. Both videos elicited learning of word boundaries that was equivalent to that seen in the first experiment. Again, no correlations were found between segmentation performance and SRS scores in either condition. These results, taken with those in Experiment, suggest that neither the eyes nor mouth are critical to speech segmentation and that perhaps more global head movements indicate word boundaries (see Graf, Cosatto, Strom, & Huang, 2002). Future work will elucidate the contribution of individual features relative to global head movements, as well as extend these results to additional types of speech tasks.
Resumo:
WE INVESTIGATED HOW WELL STRUCTURAL FEATURES such as note density or the relative number of changes in the melodic contour could predict success in implicit and explicit memory for unfamiliar melodies. We also analyzed which features are more likely to elicit increasingly confident judgments of "old" in a recognition memory task. An automated analysis program computed structural aspects of melodies, both independent of any context, and also with reference to the other melodies in the testset and the parent corpus of pop music. A few features predicted success in both memory tasks, which points to a shared memory component. However, motivic complexity compared to a large corpus of pop music had different effects on explicit and implicit memory. We also found that just a few features are associated with different rates of "old" judgments, whether the items were old or new. Rarer motives relative to the testset predicted hits and rarer motives relative to the corpus predicted false alarms. This data-driven analysis provides further support for both shared and separable mechanisms in implicit and explicit memory retrieval, as well as the role of distinctiveness in true and false judgments of familiarity.
Resumo:
Recent advances in the field of statistical learning have established that learners are able to track regularities of multimodal stimuli, yet it is unknown whether the statistical computations are performed on integrated representations or on separate, unimodal representations. In the present study, we investigated the ability of adults to integrate audio and visual input during statistical learning. We presented learners with a speech stream synchronized with a video of a speaker's face. In the critical condition, the visual (e.g., /gi/) and auditory (e.g., /mi/) signals were occasionally incongruent, which we predicted would produce the McGurk illusion, resulting in the perception of an audiovisual syllable (e.g., /ni/). In this way, we used the McGurk illusion to manipulate the underlying statistical structure of the speech streams, such that perception of these illusory syllables facilitated participants' ability to segment the speech stream. Our results therefore demonstrate that participants can integrate audio and visual input to perceive the McGurk illusion during statistical learning. We interpret our findings as support for modality-interactive accounts of statistical learning.
Comparative Analysis of Russian and French Prosodies: Theoretical, Experimental and Applied Aspects"
Resumo:
Experience shows that in teaching the pronunciation of a foreign language, it is the native syllable stereotype that resists correction most strongly. This is because the syllable is the basic unit of the perception and production of speech, and syllabic production is highly automatic and to some degree determines the prosody of speech at all levels: accent, rhythm, phrase, etc. The results of psycho-physiological studies show that the human acoustic analyser is a typical contemplator organ and new acoustic qualities are perceived through their inclusion into the already existing system of values characteristic to the mother tongue. This results in the adaptation of the perception and so production of foreign speech to native patterns. The less conscious the perception of the unit and the more 'primitive' its status, the greater the degree of its auditory assimilation, and the syllable is certainly among the less controllable linguistic units. The group carried out a complex investigation of the French and Russian languages at the level of syllable realisation, focusing on the stressed syllable of both open and closed types. The useful acoustic characteristics of the French/Russian syllable pattern were determined through identifying a typical syllable pattern within the system of each of the two languages, comparing these patterns to establish their contrasting features, and observing and systematising deviations from the pattern typical of the French/Russian language teaching situation. The components of the syllable pattern shown to need particular attention in teaching French pronunciation to Russian native speakers were intensity, fundamental frequency, and duration. The group then developed a method of correction which combines the auditory and visual canals of sound signal perception and tested this method with groups of Russian students of different levels.
Resumo:
A new implantable hearing system, the direct acoustic cochlear stimulator (DACS) is presented. This system is based on the principle of a power-driven stapes prosthesis and intended for the treatment of severe mixed hearing loss due to advanced otosclerosis. It consists of an implantable electromagnetic transducer, which transfers acoustic energy directly to the inner ear, and an audio processor worn externally behind the implanted ear. The device is implanted using a specially developed retromeatal microsurgical approach. After removal of the stapes, a conventional stapes prosthesis is attached to the transducer and placed in the oval window to allow direct acoustical coupling to the perilymph of the inner ear. In order to restore the natural sound transmission of the ossicular chain, a second stapes prosthesis is placed in parallel to the first one into the oval window and attached to the patient's own incus, as in a conventional stapedectomy. Four patients were implanted with an investigational DACS device. The hearing threshold of the implanted ears before implantation ranged from 78 to 101 dB (air conduction, pure tone average, 0.5-4 kHz) with air-bone gaps of 33-44 dB in the same frequency range. Postoperatively, substantial improvements in sound field thresholds, speech intelligibility as well as in the subjective assessment of everyday situations were found in all patients. Two years after the implantations, monosyllabic word recognition scores in quiet at 75 dB improved by 45-100 percent points when using the DACS. Furthermore, hearing thresholds were already improved by the second stapes prosthesis alone by 14-28 dB (pure tone average 0.5-4 kHz, DACS switched off). No device-related serious medical complications occurred and all patients have continued to use their device on a daily basis for over 2 years. Copyright (c) 2008 S. Karger AG, Basel.
Resumo:
From Bush’s September 20, 2001 “War on Terror” speech to Congress to President-Elect Barack Obama’s acceptance speech on November 4, 2008, the U.S. Army produced visual recruitment material that addressed the concerns of falling enlistment numbers—due to the prolonged and difficult war in Iraq—with quickly-evolving and compelling rhetorical appeals: from the introduction of an “Army of One” (2001) to “Army Strong” (2006); from messages focused on education and individual identity to high-energy adventure and simulated combat scenarios, distributed through everything from printed posters and music videos to first-person tactical-shooter video games. These highly polished, professional visual appeals introduced to the American public during a time of an unpopular war fought by volunteers provide rich subject matter for research and analysis. This dissertation takes a multidisciplinary approach to the visual media utilized as part of the Army’s recruitment efforts during the War on Terror, focusing on American myths—as defined by Barthes—and how these myths are both revealed and reinforced through design across media platforms. Placing each selection in its historical context, this dissertation analyzes how printed materials changed as the War on Terror continued. It examines the television ad that introduced “Army Strong” to the American public, considering how the combination of moving image, text, and music structure the message and the way we receive it. This dissertation also analyzes the video game America’s Army, focusing on how the interaction of the human player and the computer-generated player combine to enhance the persuasive qualities of the recruitment message. Each chapter discusses how the design of the particular medium facilitates engagement/interactivity of the viewer. The conclusion considers what recruitment material produced during this time period suggests about the persuasive strategies of different media and how they create distinct relationships with their spectators. It also addresses how theoretical frameworks and critical concepts used by a variety of disciplines can be combined to analyze recruitment media utilizing a Selber inspired three literacy framework (functional, critical, rhetorical) and how this framework can contribute to the multimodal classroom by allowing instructors and students to do a comparative analysis of multiple forms of visual media with similar content.
Resumo:
OBJECTIVE: To develop a novel application of a tool for semi-automatic volume segmentation and adapt it for analysis of fetal cardiac cavities and vessels from heart volume datasets. METHODS: We studied retrospectively virtual cardiac volume cycles obtained with spatiotemporal image correlation (STIC) from six fetuses with postnatally confirmed diagnoses: four with normal hearts between 19 and 29 completed gestational weeks, one with d-transposition of the great arteries and one with hypoplastic left heart syndrome. The volumes were analyzed offline using a commercially available segmentation algorithm designed for ovarian folliculometry. Using this software, individual 'cavities' in a static volume are selected and assigned individual colors in cross-sections and in 3D-rendered views, and their dimensions (diameters and volumes) can be calculated. RESULTS: Individual segments of fetal cardiac cavities could be separated, adjacent segments merged and the resulting electronic casts studied in their spatial context. Volume measurements could also be performed. Exemplary images and interactive videoclips showing the segmented digital casts were generated. CONCLUSION: The approach presented here is an important step towards an automated fetal volume echocardiogram. It has the potential both to help in obtaining a correct structural diagnosis, and to generate exemplary visual displays of cardiac anatomy in normal and structurally abnormal cases for consultation and teaching.
Resumo:
Obesity is becoming an epidemic phenomenon in most developed countries. The fundamental cause of obesity and overweight is an energy imbalance between calories consumed and calories expended. It is essential to monitor everyday food intake for obesity prevention and management. Existing dietary assessment methods usually require manually recording and recall of food types and portions. Accuracy of the results largely relies on many uncertain factors such as user's memory, food knowledge, and portion estimations. As a result, the accuracy is often compromised. Accurate and convenient dietary assessment methods are still blank and needed in both population and research societies. In this thesis, an automatic food intake assessment method using cameras, inertial measurement units (IMUs) on smart phones was developed to help people foster a healthy life style. With this method, users use their smart phones before and after a meal to capture images or videos around the meal. The smart phone will recognize food items and calculate the volume of the food consumed and provide the results to users. The technical objective is to explore the feasibility of image based food recognition and image based volume estimation. This thesis comprises five publications that address four specific goals of this work: (1) to develop a prototype system with existing methods to review the literature methods, find their drawbacks and explore the feasibility to develop novel methods; (2) based on the prototype system, to investigate new food classification methods to improve the recognition accuracy to a field application level; (3) to design indexing methods for large-scale image database to facilitate the development of new food image recognition and retrieval algorithms; (4) to develop novel convenient and accurate food volume estimation methods using only smart phones with cameras and IMUs. A prototype system was implemented to review existing methods. Image feature detector and descriptor were developed and a nearest neighbor classifier were implemented to classify food items. A reedit card marker method was introduced for metric scale 3D reconstruction and volume calculation. To increase recognition accuracy, novel multi-view food recognition algorithms were developed to recognize regular shape food items. To further increase the accuracy and make the algorithm applicable to arbitrary food items, new food features, new classifiers were designed. The efficiency of the algorithm was increased by means of developing novel image indexing method in large-scale image database. Finally, the volume calculation was enhanced through reducing the marker and introducing IMUs. Sensor fusion technique to combine measurements from cameras and IMUs were explored to infer the metric scale of the 3D model as well as reduce noises from these sensors.
Resumo:
This article describes a series of experiments which were carried out to measure the sense of presence in auditory virtual environments. Within the study a comparison of self-created signals to signals created by the surrounding environment is drawn. Furthermore, it is investigated if the room characteristics of the simulated environment have consequences on the perception of presence during vocalization or when listening to speech. Finally the experiments give information about the influence of background signals on the sense of presence. In the experiments subjects rated the degree of perceived presence in an auditory virtual environment on a perceptual scale. It is described which parameters have the most influence on the perception of presence and which ones are of minor influence. The results show that on the one hand an external speaker has more influence on the sense of presence than an adequate presentation of one’s own voice. On the other hand both room reflections and adequately presented background signals significantly increase the perceived presence in the virtual environment.
Resumo:
Visual fixation is employed by humans and some animals to keep a specific 3D location at the center of the visual gaze. Inspired by this phenomenon in nature, this paper explores the idea to transfer this mechanism to the context of video stabilization for a handheld video camera. A novel approach is presented that stabilizes a video by fixating on automatically extracted 3D target points. This approach is different from existing automatic solutions that stabilize the video by smoothing. To determine the 3D target points, the recorded scene is analyzed with a stateof- the-art structure-from-motion algorithm, which estimates camera motion and reconstructs a 3D point cloud of the static scene objects. Special algorithms are presented that search either virtual or real 3D target points, which back-project close to the center of the image for as long a period of time as possible. The stabilization algorithm then transforms the original images of the sequence so that these 3D target points are kept exactly in the center of the image, which, in case of real 3D target points, produces a perfectly stable result at the image center. Furthermore, different methods of additional user interaction are investigated. It is shown that the stabilization process can easily be controlled and that it can be combined with state-of-theart tracking techniques in order to obtain a powerful image stabilization tool. The approach is evaluated on a variety of videos taken with a hand-held camera in natural scenes.
Resumo:
1 Natural soil profiles may be interpreted as an arrangement of parts which are characterized by properties like hydraulic conductivity and water retention function. These parts form a complicated structure. Characterizing the soil structure is fundamental in subsurface hydrology because it has a crucial influence on flow and transport and defines the patterns of many ecological processes. We applied an image analysis method for recognition and classification of visual soil attributes in order to model flow and transport through a man-made soil profile. Modeled and measured saturation-dependent effective parameters were compared. We found that characterizing and describing conductivity patterns in soils with sharp conductivity contrasts is feasible. Differently, solving flow and transport on the basis of these conductivity maps is difficult and, in general, requires special care for representation of small-scale processes.
Resumo:
Effective visual exploration is required for many activities of daily living and instruments to assess visual exploration are important for the evaluation of the visual and the oculomotor system. In this article, the development of a new instrument to measure central and peripheral target recognition is described. The measurement setup consists of a hemispherical projection which allows presenting images over a large area of ±90° horizontal and vertical angle. In a feasibility study with 14 younger (21–49 years) and 12 older (50–78 years) test persons, 132 targets and 24 distractors were presented within naturalistic color photographs of everyday scenes at 10°, 30°, and 50° eccentricity. After the experiment, both younger and older participants reported in a questionnaire that the task is easy to understand, fun and that it measures a competence that is relevant for activities of daily living. A main result of the pilot study was that younger participants recognized more targets with smaller reaction times than older participants. The group differences were most pronounced for peripheral target detection. This test is feasible and appropriate to assess the functional field of view in younger and older adults.
Resumo:
Comprehending speech is one of the most important human behaviors, but we are only beginning to understand how the brain accomplishes this difficult task. One key to speech perception seems to be that the brain integrates the independent sources of information available in the auditory and visual modalities in a process known as multisensory integration. This allows speech perception to be accurate, even in environments in which one modality or the other is ambiguous in the context of noise. Previous electrophysiological and functional magnetic resonance imaging (fMRI) experiments have implicated the posterior superior temporal sulcus (STS) in auditory-visual integration of both speech and non-speech stimuli. While evidence from prior imaging studies have found increases in STS activity for audiovisual speech compared with unisensory auditory or visual speech, these studies do not provide a clear mechanism as to how the STS communicates with early sensory areas to integrate the two streams of information into a coherent audiovisual percept. Furthermore, it is currently unknown if the activity within the STS is directly correlated with strength of audiovisual perception. In order to better understand the cortical mechanisms that underlie audiovisual speech perception, we first studied the STS activity and connectivity during the perception of speech with auditory and visual components of varying intelligibility. By studying fMRI activity during these noisy audiovisual speech stimuli, we found that STS connectivity with auditory and visual cortical areas mirrored perception; when the information from one modality is unreliable and noisy, the STS interacts less with the cortex processing that modality and more with the cortex processing the reliable information. We next characterized the role of STS activity during a striking audiovisual speech illusion, the McGurk effect, to determine if activity within the STS predicts how strongly a person integrates auditory and visual speech information. Subjects with greater susceptibility to the McGurk effect exhibited stronger fMRI activation of the STS during perception of McGurk syllables, implying a direct correlation between strength of audiovisual integration of speech and activity within an the multisensory STS.