925 resultados para Compressed speech
Resumo:
Keyword Spotting is the task of detecting keywords of interest within continu- ous speech. The applications of this technology range from call centre dialogue systems to covert speech surveillance devices. Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unre- stricted vocabulary audio document indexing. However, to date, many keyword spotting approaches have su®ered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability. This work investigates the application of keyword spotting to data mining tasks. The thesis makes a number of major contributions to the ¯eld of keyword spotting. The ¯rst major contribution is the development of a novel keyword veri¯cation method named Cohort Word Veri¯cation. This method combines high level lin- guistic information with cohort-based veri¯cation techniques to obtain dramatic improvements in veri¯cation performance, in particular for the problematic short duration target word class. The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting. This technique aug- ments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations. The resulting algorithm obtains signi¯cant improvement in detection rate over lattice-based audio document indexing while still maintaining extremely fast search speeds. The third major contribution is the study of multiple veri¯er fusion for the task of keyword veri¯cation. The reported experiments demonstrate that substantial improvements in veri¯cation performance can be obtained through the fusion of multiple keyword veri¯ers. The research focuses on combinations of speech background model based veri¯ers and cohort word veri¯ers. The ¯nal major contribution is a comprehensive study of the e®ects of limited training data for keyword spotting. This study is performed with consideration as to how these e®ects impact the immediate development and deployment of speech technologies for non-English languages.
Resumo:
Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.
Resumo:
In this paper we propose a new method for utilising phase information by complementing it with traditional magnitude-only spectral subtraction speech enhancement through Complex Spectrum Subtraction (CSS). The proposed approach has the following advantages over traditional magnitude-only spectral subtraction: (a) it introduces complementary information to the enhancement algorithm; (b) it reduces the total number of algorithmic parameters, and; (c) is designed for improving clean speech magnitude spectra and is therefore suitable for both automatic speech recognition (ASR) and speech perception applications. Oracle-based ASR experiments verify this approach, showing an average of 20% relative word accuracy improvements when accurate estimates of the phase spectrum are available. Based on sinusoidal analysis and assuming stationarity between observations (which is shown to be better approximated as the frame rate is increased), this paper also proposes a novel method for acquiring the phase information called Phase Estimation via Delay Projection (PEDEP). Further oracle ASR experiments validate the potential for the proposed PEDEP technique in ideal conditions. Realistic implementation of CSS with PEDEP shows performance comparable to state of the art spectral subtraction techniques in a range of 15-20 dB signal-to-noise ratio environments. These results clearly demonstrate the potential for using phase spectra in spectral subtractive enhancement applications, and at the same time highlight the need for deriving more accurate phase estimates in a wider range of noise conditions.
Resumo:
In this paper, we present a microphone array beamforming approach to blind speech separation. Unlike previous beamforming approaches, our system does not require a-priori knowledge of the microphone placement and speaker location, making the system directly comparable other blind source separation methods which require no prior knowledge of recording conditions. Microphone location is automatically estimated using an assumed noise field model, and speaker locations are estimated using cross correlation based methods. The system is evaluated on the data provided for the PASCAL Speech Separation Challenge 2 (SSC2), achieving a word error rate of 58% on the evaluation set.
Resumo:
Voice recognition is one of the key enablers to reduce driver distraction as in-vehicle systems become more and more complex. With the integration of voice recognition in vehicles, safety and usability are improved as the driver’s eyes and hands are not required to operate system controls. Whilst speaker independent voice recognition is well developed, performance in high noise environments (e.g. vehicles) is still limited. La Trobe University and Queensland University of Technology have developed a low-cost hardware-based speech enhancement system for automotive environments based on spectral subtraction and delay–sum beamforming techniques. The enhancement algorithms have been optimised using authentic Australian English collected under typical driving conditions. Performance tests conducted using speech data collected under variety of vehicle noise conditions demonstrate a word recognition rate improvement in the order of 10% or more under the noisiest conditions. Currently developed to a proof of concept stage there is potential for even greater performance improvement.
Resumo:
Compressed natural gas (CNG) engines are thought to be less harmful to the environment than conventional diesel engines, especially in terms of particle emissions. Although, this is true with respect to particulate matter (PM) emissions, results of particle number (PN) emission comparisons have been inconclusive. In this study, results of on-road and dynamometer studies of buses were used to derive several important conclusions. We show that, although PN emissions from CNG buses are significantly lower than from diesel buses at low engine power, they become comparable at high power. For diesel buses, PN emissions are not significantly different between acceleration and operation at steady maximum power. However, the corresponding PN emissions from CNG buses when accelerating are an order of magnitude greater than when operating at steady maximum power. During acceleration under heavy load, PN emissions from CNG buses are an order of magnitude higher than from diesel buses. The particles emitted from CNG buses are too small to contribute to PM10 emissions or contribute to a reduction of visibility, and may consist of semivolatile nanoparticles.
Resumo:
Motor vehicle emission factors are generally derived from driving tests mimicking steady state conditions or transient drive cycles. However, neither of these test conditions completely represents real world driving conditions. In particular, they fail to determine emissions generated during the accelerating phase – a condition in which urban buses spend much of their time. In this study we analyse and compare the results of time-dependant emission measurements conducted on diesel and compressed natural gas (CNG) buses during an urban driving cycle on a chassis dynamometer and we derive power-law expressions relating carbon dioxide (CO2) emission factors to the instantaneous speed while accelerating from rest. Emissions during acceleration are compared with that during steady speed operation. These results have important implications for emission modelling particularly under congested traffic conditions.
Resumo:
Interacting with technology within a vehicle environment using a voice interface can greatly reduce the effects of driver distraction. Most current approaches to this problem only utilise the audio signal, making them susceptible to acoustic noise. An obvious approach to circumvent this is to use the visual modality in addition. However, capturing, storing and distributing audio-visual data in a vehicle environment is very costly and difficult. One current dataset available for such research is the AVICAR [1] database. Unfortunately this database is largely unusable due to timing mismatch between the two streams and in addition, no protocol is available. We have overcome this problem by re-synchronising the streams on the phone-number portion of the dataset and established a protocol for further research. This paper presents the first audio-visual results on this dataset for speaker-independent speech recognition. We hope this will serve as a catalyst for future research in this area.
Resumo:
This paper presents a framework for performing real-time recursive estimation of landmarks’ visual appearance. Imaging data in its original high dimensional space is probabilistically mapped to a compressed low dimensional space through the definition of likelihood functions. The likelihoods are subsequently fused with prior information using a Bayesian update. This process produces a probabilistic estimate of the low dimensional representation of the landmark visual appearance. The overall filtering provides information complementary to the conventional position estimates which is used to enhance data association. In addition to robotics observations, the filter integrates human observations in the appearance estimates. The appearance tracks as computed by the filter allow landmark classification. The set of labels involved in the classification task is thought of as an observation space where human observations are made by selecting a label. The low dimensional appearance estimates returned by the filter allow for low cost communication in low bandwidth sensor networks. Deployment of the filter in such a network is demonstrated in an outdoor mapping application involving a human operator, a ground and an air vehicle.