682 resultados para voice activity detection

em Queensland University of Technology - ePrints Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The detection of voice activity is a challenging problem, especially when the level of acoustic noise is high. Most current approaches only utilise the audio signal, making them susceptible to acoustic noise. An obvious approach to overcome this is to use the visual modality. The current state-of-the-art visual feature extraction technique is one that uses a cascade of visual features (i.e. 2D-DCT, feature mean normalisation, interstep LDA). In this paper, we investigate the effectiveness of this technique for the task of visual voice activity detection (VAD), and analyse each stage of the cascade and quantify the relative improvement in performance gained by each successive stage. The experiments were conducted on the CUAVE database and our results highlight that the dynamics of the visual modality can be used to good effect to improve visual voice activity detection performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The QUT-NOISE-TIMIT corpus consists of 600 hours of noisy speech sequences designed to enable a thorough evaluation of voice activity detection (VAD) algorithms across a wide variety of common background noise scenarios. In order to construct the final mixed-speech database, a collection of over 10 hours of background noise was conducted across 10 unique locations covering 5 common noise scenarios, to create the QUT-NOISE corpus. This background noise corpus was then mixed with speech events chosen from the TIMIT clean speech corpus over a wide variety of noise lengths, signal-to-noise ratios (SNRs) and active speech proportions to form the mixed-speech QUT-NOISE-TIMIT corpus. The evaluation of five baseline VAD systems on the QUT-NOISE-TIMIT corpus is conducted to validate the data and show that the variety of noise available will allow for better evaluation of VAD systems than existing approaches in the literature.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

For several reasons, the Fourier phase domain is less favored than the magnitude domain in signal processing and modeling of speech. To correctly analyze the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term delta-phase spectrum is used to distinguish this from instantaneous derivatives. Experiments show that mel-frequency cepstral coefficients features derived from the delta-phase spectrum (termed Mel-Frequency delta-phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a method of voice activity detection (VAD) suitable for high noise scenarios, based on the fusion of two complementary systems. The first system uses a proposed non-Gaussianity score (NGS) feature based on normal probability testing. The second system employs a histogram distance score (HDS) feature that detects changes in the signal through conducting a template-based similarity measure between adjacent frames. The decision outputs by the two systems are then merged using an open-by-reconstruction fusion stage. Accuracy of the proposed method was compared to several baseline VAD methods on a database created using real recordings of a variety of high-noise environments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a method of voice activity detection (VAD) for high noise scenarios, using a noise robust voiced speech detection feature. The developed method is based on the fusion of two systems. The first system utilises the maximum peak of the normalised time-domain autocorrelation function (MaxPeak). The second zone system uses a novel combination of cross-correlation and zero-crossing rate of the normalised autocorrelation to approximate a measure of signal pitch and periodicity (CrossCorr) that is hypothesised to be noise robust. The score outputs by the two systems are then merged using weighted sum fusion to create the proposed autocorrelation zero-crossing rate (AZR) VAD. Accuracy of AZR was compared to state of the art and standardised VAD methods and was shown to outperform the best performing system with an average relative improvement of 24.8% in half-total error rate (HTER) on the QUT-NOISE-TIMIT database created using real recordings from high-noise environments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Visual activity detection of lip movements can be used to overcome the poor performance of voice activity detection based solely in the audio domain, particularly in noisy acoustic conditions. However, most of the research conducted in visual voice activity detection (VVAD) has neglected addressing variabilities in the visual domain such as viewpoint variation. In this paper we investigate the effectiveness of the visual information from the speaker’s frontal and profile views (i.e left and right side views) for the task of VVAD. As far as we are aware, our work constitutes the first real attempt to study this problem. We describe our visual front end approach and the Gaussian mixture model (GMM) based VVAD framework, and report the experimental results using the freely available CUAVE database. The experimental results show that VVAD is indeed possible from profile views and we give a quantitative comparison of VVAD based on frontal and profile views The results presented are useful in the development of multi-modal Human Machine Interaction (HMI) using a single camera, where the speaker’s face may not always be frontal.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a novel technique for conducting robust voice activity detection (VAD) in high-noise recordings. We use Gaussian mixture modeling (GMM) to train two generic models; speech and non-speech. We then score smaller segments of a given (unseen) recording against each of these GMMs to obtain two respective likelihood scores for each segment. These scores are used to compute a dissimilarity measure between pairs of segments and to carry out complete-linkage clustering of the segments into speech and non-speech clusters. We compare the accuracy of our method against state-of-the-art and standardised VAD techniques to demonstrate an absolute improvement of 15% in half-total error rate (HTER) over the best performing baseline system and across the QUT-NOISE-TIMIT database. We then apply our approach to the Audio-Visual Database of American English (AVDBAE) to demonstrate the performance of our algorithm in using visual, audio-visual or a proposed fusion of these features.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Telomerase is an extremely important enzyme required for the immortalisation of tumour cells. Because the gene is activated in the vast majority of tumour tissues and remains unused in most somatic cells, it represents a marker with huge diagnostic, prognostic and treatment implications in cancer. This article summarises the basic structure and functions of telomerase and considers its clinical implications in colorectal and other cancers.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The QUT-NOISE-SRE protocol is designed to mix the large QUT-NOISE database, consisting of over 10 hours of back- ground noise, collected across 10 unique locations covering 5 common noise scenarios, with commonly used speaker recognition datasets such as Switchboard, Mixer and the speaker recognition evaluation (SRE) datasets provided by NIST. By allowing common, clean, speech corpora to be mixed with a wide variety of noise conditions, environmental reverberant responses, and signal-to-noise ratios, this protocol provides a solid basis for the development, evaluation and benchmarking of robust speaker recognition algorithms, and is freely available to download alongside the QUT-NOISE database. In this work, we use the QUT-NOISE-SRE protocol to evaluate a state-of-the-art PLDA i-vector speaker recognition system, demonstrating the importance of designing voice-activity-detection front-ends specifically for speaker recognition, rather than aiming for perfect coherence with the true speech/non-speech boundaries.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a system to analyze long field recordings with low signal-to-noise ratio (SNR) for bio-acoustic monitoring. A method based on spectral peak track, Shannon entropy, harmonic structure and oscillation structure is proposed to automatically detect anuran (frog) calling activity. Gaussian mixture model (GMM) is introduced for modelling those features. Four anuran species widespread in Queensland, Australia, are selected to evaluate the proposed system. A visualization method based on extracted indices is employed for detection of anuran calling activity which achieves high accuracy.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Deep convolutional network models have dominated recent work in human action recognition as well as image classification. However, these methods are often unduly influenced by the image background, learning and exploiting the presence of cues in typical computer vision datasets. For unbiased robotics applications, the degree of variation and novelty in action backgrounds is far greater than in computer vision datasets. To address this challenge, we propose an “action region proposal” method that, informed by optical flow, extracts image regions likely to contain actions for input into the network both during training and testing. In a range of experiments, we demonstrate that manually segmenting the background is not enough; but through active action region proposals during training and testing, state-of-the-art or better performance can be achieved on individual spatial and temporal video components. Finally, we show by focusing attention through action region proposals, we can further improve upon the existing state-of-the-art in spatio-temporally fused action recognition performance.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The construction of timelines of computer activity is a part of many digital investigations. These timelines of events are composed of traces of historical activity drawn from system logs and potentially from evidence of events found in the computer file system. A potential problem with the use of such information is that some of it may be inconsistent and contradictory thus compromising its value. This work introduces a software tool (CAT Detect) for the detection of inconsistency within timelines of computer activity. We examine the impact of deliberate tampering through experiments conducted with our prototype software tool. Based on the results of these experiments, we discuss techniques which can be employed to deal with such temporal inconsistencies.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The objective of the research was to determine the optimal location and method of attachment for accelerometer-based motion sensors, and to validate their ability to differentiate rest and increases in speed in healthy dogs moving on a treadmill. Two accelerometers were placed on a harness between the scapulae of dogs with one in a pouch and one directly attached to the harness. Two additional accelerometers were placed (pouched and not pouched) ventrally on the dog's collar. Data were recorded in 1. s epochs with dogs moving in stages lasting 3. min each on a treadmill: (1) at rest, lateral recumbency, (2) treadmill at 0% slope, 3. km/h, (3) treadmill at 0% slope, 5. km/h, (4) treadmill at 0% slope, 7. km/h, (5) treadmill at 5% slope, 5. km/h, and; (6) treadmill at 5% slope, 7. km/h. Only the harness with the accelerometer in a pouch along the dorsal midline yielded statistically significant increases (P< 0.05) in vector magnitude as walking speed of the dogs increased (5-7. km/h) while on the treadmill. Statistically significant increases in vector magnitude were detected in the dogs as the walking speed increased from 5 to 7. km/h, however, changes in vector magnitude were not detected when activity intensity was increased as a result of walking up a 5% grade. Accelerometers are a valid and objective tool able to discriminate between and monitor different levels of activity in dogs in terms of speed of movement but not in energy expenditure that occurs with movement up hill.