161 resultados para Audio signal

em Queensland University of Technology - ePrints Archive


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Interacting with technology within a vehicle environment using a voice interface can greatly reduce the effects of driver distraction. Most current approaches to this problem only utilise the audio signal, making them susceptible to acoustic noise. An obvious approach to circumvent this is to use the visual modality in addition. However, capturing, storing and distributing audio-visual data in a vehicle environment is very costly and difficult. One current dataset available for such research is the AVICAR [1] database. Unfortunately this database is largely unusable due to timing mismatch between the two streams and in addition, no protocol is available. We have overcome this problem by re-synchronising the streams on the phone-number portion of the dataset and established a protocol for further research. This paper presents the first audio-visual results on this dataset for speaker-independent speech recognition. We hope this will serve as a catalyst for future research in this area.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Visual noise insensitivity is important to audio visual speech recognition (AVSR). Visual noise can take on a number of forms such as varying frame rate, occlusion, lighting or speaker variabilities. The use of a high dimensional secondary classifier on the word likelihood scores from both the audio and video modalities is investigated for the purposes of adaptive fusion. Preliminary results are presented demonstrating performance above the catastrophic fusion boundary for our confidence measure irrespective of the type of visual noise presented to it. Our experiments were restricted to small vocabulary applications.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Amphibian is an 10’00’’ musical work which explores new musical interfaces and approaches to hybridising performance practices from the popular music, electronic dance music and computer music traditions. The work is designed to be presented in a range of contexts associated with the electro-acoustic, popular and classical music traditions. The work is for two performers using two synchronised laptops, an electric guitar and a custom designed gestural interface for vocal performers - the e-Mic (Extended Mic-stand Interface Controller). This interface was developed by one of the co-authors, Donna Hewitt. The e-Mic allows a vocal performer to manipulate the voice in real time through the capture of physical gestures via an array of sensors - pressure, distance, tilt - along with ribbon controllers and an X-Y joystick microphone mount. Performance data are then sent to a computer, running audio-processing software, which is used to transform the audio signal from the microphone. In this work, data is also exchanged between performers via a local wireless network, allowing performers to work with shared data streams. The duo employs the gestural conventions of guitarist and singer (i.e. 'a band' in a popular music context), but transform these sounds and gestures into new digital music. The gestural language of popular music is deliberately subverted and taken into a new context. The piece thus explores the nexus between the sonic and performative practices of electro acoustic music and intelligent electronic dance music (‘idm’). This work was situated in the research fields of new musical interfacing, interaction design, experimental music composition and performance. The contexts in which the research was conducted were live musical performance and studio music production. The work investigated new methods for musical interfacing, performance data mapping, hybrid performance and compositional practices in electronic music. The research methodology was practice-led. New insights were gained from the iterative experimental workshopping of gestural inputs, musical data mapping, inter-performer data exchange, software patch design, data and audio processing chains. In respect of interfacing, there were innovations in the design and implementation of a novel sensor-based gestural interface for singers, the e-Mic, one of the only existing gestural controllers for singers. This work explored the compositional potential of sharing real time performance data between performers and deployed novel methods for inter-performer data exchange and mapping. As regards stylistic and performance innovation, the work explored and demonstrated an approach to the hybridisation of the gestural and sonic language of popular music with recent ‘post-digital’ approaches to laptop based experimental music The development of the work was supported by an Australia Council Grant. Research findings have been disseminated via a range of international conference publications, recordings, radio interviews (ABC Classic FM), broadcasts, and performances at international events and festivals. The work was curated into the major Australian international festival, Liquid Architecture, and was selected by an international music jury (through blind peer review) for presentation at the International Computer Music Conference in Belfast, N. Ireland.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Nodule is 19'54" musical work for two electronic music performers, two laptop computers and a custom built, sensor-based microphone controller - the e-Mic (Extended Mic-stand Interface Controller). This interface was developed by one of the co-authors, Donna Hewitt. The e-Mic allows a vocal performer to manipulate their voice in real time by capturing physical gestures via an array of sensors - pressure, distance, tilt – in addition to ribbon controllers and an X-Y joystick microphone mount. Performance data are then sent to a computer, running audio-processing software, which is used to transform the audio signal from the microphone in real time. The work seeks to explore the liminal space between the electro-acoustic music tradition and more recent developments in the electronic dance music tradition. It does so on both a performative (gestural) and compositional (sonic) level. Visually, the performance consists of a singer and a laptop performer, hybridising the gestural context of these traditions. On a sonic level, the work explores hybridity at deeper levels of the musical structure than simple bricolage or collage approaches. Hybridity is explored at the level of the sonic gesture (source material), in production (audio processing gestures), in performance gesture, and in approaches to the use of the frequency spectrum, pulse and meter. The work was designed to be performed in a range of contexts from concert halls, to clubs, to rock festivals, across a range of staging and production platforms. As a consequence, the work has been tested in a range of audience contexts, and has allowed the transportation of compositional and performance practices across traditional audience demographic boundaries.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The detection of voice activity is a challenging problem, especially when the level of acoustic noise is high. Most current approaches only utilise the audio signal, making them susceptible to acoustic noise. An obvious approach to overcome this is to use the visual modality. The current state-of-the-art visual feature extraction technique is one that uses a cascade of visual features (i.e. 2D-DCT, feature mean normalisation, interstep LDA). In this paper, we investigate the effectiveness of this technique for the task of visual voice activity detection (VAD), and analyse each stage of the cascade and quantify the relative improvement in performance gained by each successive stage. The experiments were conducted on the CUAVE database and our results highlight that the dynamics of the visual modality can be used to good effect to improve visual voice activity detection performance.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Speaker diarization is the process of annotating an input audio with information that attributes temporal regions of the audio signal to their respective sources, which may include both speech and non-speech events. For speech regions, the diarization system also specifies the locations of speaker boundaries and assign relative speaker labels to each homogeneous segment of speech. In short, speaker diarization systems effectively answer the question of ‘who spoke when’. There are several important applications for speaker diarization technology, such as facilitating speaker indexing systems to allow users to directly access the relevant segments of interest within a given audio, and assisting with other downstream processes such as summarizing and parsing. When combined with automatic speech recognition (ASR) systems, the metadata extracted from a speaker diarization system can provide complementary information for ASR transcripts including the location of speaker turns and relative speaker segment labels, making the transcripts more readable. Speaker diarization output can also be used to localize the instances of specific speakers to pool data for model adaptation, which in turn boosts transcription accuracies. Speaker diarization therefore plays an important role as a preliminary step in automatic transcription of audio data. The aim of this work is to improve the usefulness and practicality of speaker diarization technology, through the reduction of diarization error rates. In particular, this research is focused on the segmentation and clustering stages within a diarization system. Although particular emphasis is placed on the broadcast news audio domain and systems developed throughout this work are also trained and tested on broadcast news data, the techniques proposed in this dissertation are also applicable to other domains including telephone conversations and meetings audio. Three main research themes were pursued: heuristic rules for speaker segmentation, modelling uncertainty in speaker model estimates, and modelling uncertainty in eigenvoice speaker modelling. The use of heuristic approaches for the speaker segmentation task was first investigated, with emphasis placed on minimizing missed boundary detections. A set of heuristic rules was proposed, to govern the detection and heuristic selection of candidate speaker segment boundaries. A second pass, using the same heuristic algorithm with a smaller window, was also proposed with the aim of improving detection of boundaries around short speaker segments. Compared to single threshold based methods, the proposed heuristic approach was shown to provide improved segmentation performance, leading to a reduction in the overall diarization error rate. Methods to model the uncertainty in speaker model estimates were developed, to address the difficulties associated with making segmentation and clustering decisions with limited data in the speaker segments. The Bayes factor, derived specifically for multivariate Gaussian speaker modelling, was introduced to account for the uncertainty of the speaker model estimates. The use of the Bayes factor also enabled the incorporation of prior information regarding the audio to aid segmentation and clustering decisions. The idea of modelling uncertainty in speaker model estimates was also extended to the eigenvoice speaker modelling framework for the speaker clustering task. Building on the application of Bayesian approaches to the speaker diarization problem, the proposed approach takes into account the uncertainty associated with the explicit estimation of the speaker factors. The proposed decision criteria, based on Bayesian theory, was shown to generally outperform their non- Bayesian counterparts.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Acoustically, car cabins are extremely noisy and as a consequence audio-only, in-car voice recognition systems perform poorly. As the visual modality is immune to acoustic noise, using the visual lip information from the driver is seen as a viable strategy in circumventing this problem by using audio visual automatic speech recognition (AVASR). However, implementing AVASR requires a system being able to accurately locate and track the drivers face and lip area in real-time. In this paper we present such an approach using the Viola-Jones algorithm. Using the AVICAR [1] in-car database, we show that the Viola- Jones approach is a suitable method of locating and tracking the driver’s lips despite the visual variability of illumination and head pose for audio-visual speech recognition system.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Acoustically, car cabins are extremely noisy and as a consequence, existing audio-only speech recognition systems, for voice-based control of vehicle functions such as the GPS based navigator, perform poorly. Audio-only speech recognition systems fail to make use of the visual modality of speech (eg: lip movements). As the visual modality is immune to acoustic noise, utilising this visual information in conjunction with an audio only speech recognition system has the potential to improve the accuracy of the system. The field of recognising speech using both auditory and visual inputs is known as Audio Visual Speech Recognition (AVSR). Continuous research in AVASR field has been ongoing for the past twenty-five years with notable progress being made. However, the practical deployment of AVASR systems for use in a variety of real-world applications has not yet emerged. The main reason is due to most research to date neglecting to address variabilities in the visual domain such as illumination and viewpoint in the design of the visual front-end of the AVSR system. In this paper we present an AVASR system in a real-world car environment using the AVICAR database [1], which is publicly available in-car database and we show that the use of visual speech conjunction with the audio modality is a better approach to improve the robustness and effectiveness of voice-only recognition systems in car cabin environments.

Relevância:

30.00% 30.00%

Publicador: