932 resultados para Speech and pioneering sports Colima
Resumo:
We present a novel approach to represent transients using spectral-domain amplitude-modulated/frequency -modulated (AM-FM) functions. The model is applied to the real and imaginary parts of the Fourier transform (FT) of the transient. The suitability of the model lies in the observation that since transients are well-localized in time, the real and imaginary parts of the Fourier spectrum have a modulation structure. The spectral AM is the envelope and the spectral FM is the group delay function. The group delay is estimated using spectral zero-crossings and the spectral envelope is estimated using a coherent demodulator. We show that the proposed technique is robust to additive noise. We present applications of the proposed technique to castanets and stop-consonants in speech.
Resumo:
Narrowband spectrograms of voiced speech can be modeled as an outcome of two-dimensional (2-D) modulation process. In this paper, we develop a demodulation algorithm to estimate the 2-D amplitude modulation (AM) and carrier of a given spectrogram patch. The demodulation algorithm is based on the Riesz transform, which is a unitary, shift-invariant operator and is obtained as a 2-D extension of the well known 1-D Hilbert transform operator. Existing methods for spectrogram demodulation rely on extension of sinusoidal demodulation method from the communications literature and require precise estimate of the 2-D carrier. On the other hand, the proposed method based on Riesz transform does not require a carrier estimate. The proposed method and the sinusoidal demodulation scheme are tested on real speech data. Experimental results show that the demodulated AM and carrier from Riesz demodulation represent the spectrogram patch more accurately compared with those obtained using the sinusoidal demodulation. The signal-to-reconstruction error ratio was found to be about 2 to 6 dB higher in case of the proposed demodulation approach.
Binaural Signal Processing Motivated Generalized Analytic Signal Construction and AM-FM Demodulation
Resumo:
Binaural hearing studies show that the auditory system uses the phase-difference information in the auditory stimuli for localization of a sound source. Motivated by this finding, we present a method for demodulation of amplitude-modulated-frequency-modulated (AM-FM) signals using a ignal and its arbitrary phase-shifted version. The demodulation is achieved using two allpass filters, whose impulse responses are related through the fractional Hilbert transform (FrHT). The allpass filters are obtained by cosine-modulation of a zero-phase flat-top prototype halfband lowpass filter. The outputs of the filters are combined to construct an analytic signal (AS) from which the AM and FM are estimated. We show that, under certain assumptions on the signal and the filter structures, the AM and FM can be obtained exactly. The AM-FM calculations are based on the quasi-eigenfunction approximation. We then extend the concept to the demodulation of multicomponent signals using uniform and non-uniform cosine-modulated filterbank (FB) structures consisting of flat bandpass filters, including the uniform cosine-modulated, equivalent rectangular bandwidth (ERB), and constant-Q filterbanks. We validate the theoretical calculations by considering application on synthesized AM-FM signals and compare the performance in presence of noise with three other multiband demodulation techniques, namely, the Teager-energy-based approach, the Gabor's AS approach, and the linear transduction filter approach. We also show demodulation results for real signals.
Resumo:
Grating Compression Transform (GCT) is a two-dimensional analysis of speech signal which has been shown to be effective in multi-pitch tracking in speech mixtures. Multi-pitch tracking methods using GCT apply Kalman filter framework to obtain pitch tracks which requires training of the filter parameters using true pitch tracks. We propose an unsupervised method for obtaining multiple pitch tracks. In the proposed method, multiple pitch tracks are modeled using time-varying means of a Gaussian mixture model (GMM), referred to as TVGMM. The TVGMM parameters are estimated using multiple pitch values at each frame in a given utterance obtained from different patches of the spectrogram using GCT. We evaluate the performance of the proposed method on all voiced speech mixtures as well as random speech mixtures having well separated and close pitch tracks. TVGMM achieves multi-pitch tracking with 51% and 53% multi-pitch estimates having error <= 20% for random mixtures and all-voiced mixtures respectively. TVGMM also results in lower root mean squared error in pitch track estimation compared to that by Kalman filtering.
Resumo:
We address the problem of designing an optimal pointwise shrinkage estimator in the transform domain, based on the minimum probability of error (MPE) criterion. We assume an additive model for the noise corrupting the clean signal. The proposed formulation is general in the sense that it can handle various noise distributions. We consider various noise distributions (Gaussian, Student's-t, and Laplacian) and compare the denoising performance of the estimator obtained with the mean-squared error (MSE)-based estimators. The MSE optimization is carried out using an unbiased estimator of the MSE, namely Stein's Unbiased Risk Estimate (SURE). Experimental results show that the MPE estimator outperforms the SURE estimator in terms of SNR of the denoised output, for low (0 -10 dB) and medium values (10 - 20 dB) of the input SNR.
Resumo:
We propose a two-dimensional (2-D) multicomponent amplitude-modulation, frequency-modulation (AM-FM) model for a spectrogram patch corresponding to voiced speech, and develop a new demodulation algorithm to effectively separate the AM, which is related to the vocal tract response, and the carrier, which is related to the excitation. The demodulation algorithm is based on the Riesz transform and is developed along the lines of Hilbert-transform-based demodulation for 1-D AM-FM signals. We compare the performance of the Riesz transform technique with that of the sinusoidal demodulation technique on real speech data. Experimental results show that the Riesz-transform-based demodulation technique represents spectrogram patches accurately. The spectrograms reconstructed from the demodulated AM and carrier are inverted and the corresponding speech signal is synthesized. The signal-to-noise ratio (SNR) of the reconstructed speech signal, with respect to clean speech, was found to be 2 to 4 dB higher in case of the Riesz transform technique than the sinusoidal demodulation technique.
Resumo:
Oversmoothing of speech parameter trajectories is one of the causes for quality degradation of HMM-based speech synthesis. Various methods have been proposed to overcome this effect, the most recent ones being global variance (GV) and modulation-spectrum-based post-filter (MSPF). However, there is still a significant quality gap between natural and synthesized speech. In this paper, we propose a two-fold post-filtering technique to alleviate to a certain extent the oversmoothing of spectral and excitation parameter trajectories of HMM-based speech synthesis. For the spectral parameters, we propose a sparse coding-based post-filter to match the trajectories of synthetic speech to that of natural speech, and for the excitation trajectory, we introduce a perceptually motivated post-filter. Experimental evaluations show quality improvement compared with existing methods.
Resumo:
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaptation may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequences. ©2010 IEEE.
Resumo:
We present methods for fixed-lag smoothing using Sequential Importance sampling (SIS) on a discrete non-linear, non-Gaussian state space system with unknown parameters. Our particular application is in the field of digital communication systems. Each input data point is taken from a finite set of symbols. We represent transmission media as a fixed filter with a finite impulse response (FIR), hence a discrete state-space system is formed. Conventional Markov chain Monte Carlo (MCMC) techniques such as the Gibbs sampler are unsuitable for this task because they can only perform processing on a batch of data. Data arrives sequentially, so it would seem sensible to process it in this way. In addition, many communication systems are interactive, so there is a maximum level of latency that can be tolerated before a symbol is decoded. We will demonstrate this method by simulation and compare its performance to existing techniques.
Resumo:
We develop methods for performing filtering and smoothing in non-linear non-Gaussian dynamical models. The methods rely on a particle cloud representation of the filtering distribution which evolves through time using importance sampling and resampling ideas. In particular, novel techniques are presented for generation of random realisations from the joint smoothing distribution and for MAP estimation of the state sequence. Realisations of the smoothing distribution are generated in a forward-backward procedure, while the MAP estimation procedure can be performed in a single forward pass of the Viterbi algorithm applied to a discretised version of the state space. An application to spectral estimation for time-varying autoregressions is described.
Resumo:
In this paper we address the problem of the separation and recovery of convolutively mixed autoregressive processes in a Bayesian framework. Solving this problem requires the ability to solve integration and/or optimization problems of complicated posterior distributions. We thus propose efficient stochastic algorithms based on Markov chain Monte Carlo (MCMC) methods. We present three algorithms. The first one is a classical Gibbs sampler that generates samples from the posterior distribution. The two other algorithms are stochastic optimization algorithms that allow to optimize either the marginal distribution of the sources, or the marginal distribution of the parameters of the sources and mixing filters, conditional upon the observation. Simulations are presented.
Resumo:
This paper describes the development of the CU-HTK Mandarin Speech-To-Text (STT) system and assesses its performance as part of a transcription-translation pipeline which converts broadcast Mandarin audio into English text. Recent improvements to the STT system are described and these give Character Error Rate (CER) gains of 14.3% absolute for a Broadcast Conversation (BC) task and 5.1% absolute for a Broadcast News (BN) task. The output of these STT systems is then post-processed, so that it consists of sentence-like segments, and translated into English text using a Statistical Machine Translation (SMT) system. The performance of the transcription-translation pipeline is evaluated using the Translation Edit Rate (TER) and BLEU metrics. It is shown that improving both the STT system and the post-STT segmentations can lower the TER scores by up to 5.3% absolute and increase the BLEU scores by up to 2.7% absolute. © 2007 IEEE.
Resumo:
This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 × RT conversational speech transcription system. © 2005 IEEE.