44 resultados para Audio-Visual Automatic Speech Recognition
Resumo:
Speech recognition and language analysis of spontaneous speech arising in naturally spoken conversations are becoming the subject of much research. However, there is a shortage of spontaneous speech corpora that are freely available for academics. We therefore undertook the building of a natural conversation speech database, recording over 200 hours of conversations in English by over 600 local university students. With few exceptions, the students used their own cell phones from their own rooms or homes to speak to one another, and they were permitted to speak on any topic they chose. Although they knew that they were being recorded and that they would receive a small payment, their conversations in the corpus are probably very close to being natural and spontaneous. This paper describes a detailed case study of the problems we faced and the methods we used to make the recordings and control the collection of these social science data on a limited budget.
Resumo:
This paper considers the separation and recognition of overlapped speech sentences assuming single-channel observation. A system based on a combination of several different techniques is proposed. The system uses a missing-feature approach for improving crosstalk/noise robustness, a Wiener filter for speech enhancement, hidden Markov models for speech reconstruction, and speaker-dependent/-independent modeling for speaker and speech recognition. We develop the system on the Speech Separation Challenge database, involving a task of separating and recognizing two mixing sentences without assuming advanced knowledge about the identity of the speakers nor about the signal-to-noise ratio. The paper is an extended version of a previous conference paper submitted for the challenge.
Resumo:
Despite the importance of laughter in social interactions it remains little studied in affective computing. Respiratory, auditory, and facial laughter signals have been investigated but laughter-related body movements have received almost no attention. The aim of this study is twofold: first an investigation into observers' perception of laughter states (hilarious, social, awkward, fake, and non-laughter) based on body movements alone, through their categorization of avatars animated with natural and acted motion capture data. Significant differences in torso and limb movements were found between animations perceived as containing laughter and those perceived as nonlaughter. Hilarious laughter also differed from social laughter in the amount of bending of the spine, the amount of shoulder rotation and the amount of hand movement. The body movement features indicative of laughter differed between sitting and standing avatar postures. Based on the positive findings in this perceptual study, the second aim is to investigate the possibility of automatically predicting the distributions of observer's ratings for the laughter states. The findings show that the automated laughter recognition rates approach human rating levels, with the Random Forest method yielding the best performance.
Resumo:
This paper presents a new approach to speech enhancement from single-channel measurements involving both noise and channel distortion (i.e., convolutional noise), and demonstrates its applications for robust speech recognition and for improving noisy speech quality. The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise for speech estimation. Third, we present an iterative algorithm which updates the noise and channel estimates of the corpus data model. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement.
Resumo:
This paper presents a new approach to single-channel speech enhancement involving both noise and channel distortion (i.e., convolutional noise). The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise. Third, we present an iterative algorithm for improved speech estimates. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement. Index Terms: corpus-based speech model, longest matching segment, speech enhancement, speech recognition
Resumo:
A software system, recently developed by the authors for the efficient capturing, editing, and delivery of audio-visual web lectures, was used to create a series of lectures for a first-year undergraduate course in Dynamics. These web lectures were developed to serve as an extra study resource for students attending lectures and not as a replacement. A questionnaire was produced to obtain feedback from students. The overall response was very favorable and numerous requests were made for other lecturers to adopt this technology. Despite the students' approval of this added resource, there was no significant improvement in overall examination performance
Resumo:
Magnetoencephalography (MEG) was recorded while 5-7 year-old children were performing a visual-spatial memory recognition task. Full-term children showed greater gamma-band (30-50 Hz) amplitude in the right temporal region during the task, than children who were born extremely preterm. These results may represent altered brain processing in extremely preterm children who escape major impairment.
Resumo:
The Routledge Guide to Interviewing sets out a well-tested and practical approach and methodology: what works, difficulties and dangers to avoid and key questions which must be answered before you set out. Background methodological issues and arguments are considered and drawn upon but the focus is on what is ethical, legally acceptable and productive:
-Rationale (why, what for, where, how)
-Ethics and Legalities (informed consent, data protection, risks, embargoes)
-Resources (organisational, technical, intellectual)
-Preparation (selecting and approaching interviewees, background and biographical research, establishing credentials, identifying topics)
-Technique (developing expertise and confidence)
-Audio-visual interviews
-Analysis (modes, methods, difficulties)
-Storage (archiving and long-term preservation)
-Sharing Resources (dissemination and development)
From death row to the mansion of a head of state, small kitchens and front parlours, to legislatures and presbyteries, Anna Bryson and Seán McConville’s wide interviewing experience has been condensed into this book. The material set out here has been acquired by trial, error and reflection over a period of more than four decades. The interviewees have ranged from the delightfully straightforward to the painfully difficult to the near impossible – with a sprinkling of those that were impossible.
Successful interviewing draws on the survival skills of everyday life. This guide will help you to adapt, develop and apply these innate skills. Including a range of useful information such as sample waivers, internet resources, useful hints and checklists, it provides sound and plain-speaking support for the oral historian, social scientist and investigator.
Resumo:
Existing referencing systems frequently prove inadequate for the citation of moving image and sound media such as vidcasts, streaming television, sound files, un-catalogued archive footage, amateur content hosted online or
non-broadcast radio recordings. Back in 2009 and 2010 a British working group funded by Higher Education Funding Council for England (HEFCE) and co-ordinated by the British Universities Film and Video Council investigated this problem. This report documents the early stages of the project.
Resumo:
Through the concept of sonic resonance, the project Cidade Museu – Museum City explores five derelict or transitional spaces in the city of Viseu. The activation and capture of these spaces develops an audio- visual memory that reflects architectures, stories and experiences, while creating a sense of place through sounds and images.
The project brings together musicians with a background in contemporary music, electroacoustic music and improvisation and a visual artist focusing on photography and video.
Each member of the collective explores the selected spaces in order to activate them with the help of their respective instruments and through sound projection in an iterative process in which the source of activation gradually gives way to the characteristics of each space, their resonances and acoustic characteristics. The museum city (a nickname for the city of Viseu), in this performance, exposes the contrast between the grandeur and multi-faceted architecture of Viseu’s Cathedral with spaces that spread throughout the city waiting for a new future.
The performance in the Cathedral (Sé) is characterised by a trio ensemble, an eight channel sound system and video projecting audio recordings and images made in each of the five spaces. The audience is invited to explore the relations between the various buildings and their stories while being immersed in their resonances and visual projections.
The performance explores the following spaces in Viseu: the old Orfeão (music hall), an old wine cellar, a mansion home to the national road services, a house with its grounds in Rua Silva Gaio and an old slaughterhouse.
Resumo:
Situational awareness is achieved naturally by the human senses of sight and hearing in combination. Automatic scene understanding aims at replicating this human ability using microphones and cameras in cooperation. In this paper, audio and video signals are fused and integrated at different levels of semantic abstractions. We detect and track a speaker who is relatively unconstrained, i.e., free to move indoors within an area larger than the comparable reported work, which is usually limited to round table meetings. The system is relatively simple: consisting of just 4 microphone pairs and a single camera. Results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talk. System evaluation is performed on both single and multi-modality tracking. The performance improvement given by the audio–video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision–recall (recognition). Improvements vs. the closest works are evaluated: 56% sound source localisation computational cost over an audio only system, 8% speaker diarisation error rate over an audio only speaker recognition unit and 36% on the precision–recall metric over an audio–video dominant speaker recognition method.