48 resultados para Reported speech
Resumo:
Oversmoothing of speech parameter trajectories is one of the causes for quality degradation of HMM-based speech synthesis. Various methods have been proposed to overcome this effect, the most recent ones being global variance (GV) and modulation-spectrum-based post-filter (MSPF). However, there is still a significant quality gap between natural and synthesized speech. In this paper, we propose a two-fold post-filtering technique to alleviate to a certain extent the oversmoothing of spectral and excitation parameter trajectories of HMM-based speech synthesis. For the spectral parameters, we propose a sparse coding-based post-filter to match the trajectories of synthetic speech to that of natural speech, and for the excitation trajectory, we introduce a perceptually motivated post-filter. Experimental evaluations show quality improvement compared with existing methods.
Resumo:
Speech enhancement in stationary noise is addressed using the ideal channel selection framework. In order to estimate the binary mask, we propose to classify each time-frequency (T-F) bin of the noisy signal as speech or noise using Discriminative Random Fields (DRF). The DRF function contains two terms - an enhancement function and a smoothing term. On each T-F bin, we propose to use an enhancement function based on likelihood ratio test for speech presence, while Ising model is used as smoothing function for spectro-temporal continuity in the estimated binary mask. The effect of the smoothing function over successive iterations is found to reduce musical noise as opposed to using only enhancement function. The binary mask is inferred from the noisy signal using Iterated Conditional Modes (ICM) algorithm. Sentences from NOIZEUS corpus are evaluated from 0 dB to 15 dB Signal to Noise Ratio (SNR) in 4 kinds of additive noise settings: additive white Gaussian noise, car noise, street noise and pink noise. The reconstructed speech using the proposed technique is evaluated in terms of average segmental SNR, Perceptual Evaluation of Speech Quality (PESQ) and Mean opinion Score (MOS).
Resumo:
Acoustic feature based speech (syllable) rate estimation and syllable nuclei detection are important problems in automatic speech recognition (ASR), computer assisted language learning (CALL) and fluency analysis. A typical solution for both the problems consists of two stages. The first stage involves computing a short-time feature contour such that most of the peaks of the contour correspond to the syllabic nuclei. In the second stage, the peaks corresponding to the syllable nuclei are detected. In this work, instead of the peak detection, we perform a mode-shape classification, which is formulated as a supervised binary classification problem - mode-shapes representing the syllabic nuclei as one class and remaining as the other. We use the temporal correlation and selected sub-band correlation (TCSSBC) feature contour and the mode-shapes in the TCSSBC feature contour are converted into a set of feature vectors using an interpolation technique. A support vector machine classifier is used for the classification. Experiments are performed separately using Switchboard, TIMIT and CTIMIT corpora in a five-fold cross validation setup. The average correlation coefficients for the syllable rate estimation turn out to be 0.6761, 0.6928 and 0.3604 for three corpora respectively, which outperform those obtained by the best of the existing peak detection techniques. Similarly, the average F-scores (syllable level) for the syllable nuclei detection are 0.8917, 0.8200 and 0.7637 for three corpora respectively. (C) 2016 Elsevier B.V. All rights reserved.