935 resultados para speech bulb
Resumo:
We are addressing a new problem of improving automatic speech recognition performance, given multiple utterances of patterns from the same class. We have formulated the problem of jointly decoding K multiple patterns given a single Hidden Markov Model. It is shown that such a solution is possible by aligning the K patterns using the proposed Multi Pattern Dynamic Time Warping algorithm followed by the Constrained Multi Pattern Viterbi Algorithm The new formulation is tested in the context of speaker independent isolated word recognition for both clean and noisy patterns. When 10 percent of speech is affected by a burst noise at -5 dB Signal to Noise Ratio (local), it is shown that joint decoding using only two noisy patterns reduces the noisy speech recognition error rate to about 51 percent, when compared to the single pattern decoding using the Viterbi Algorithm. In contrast a simple maximization of individual pattern likelihoods, provides only about 7 percent reduction in error rate.
Resumo:
Considering a general linear model of signal degradation, by modeling the probability density function (PDF) of the clean signal using a Gaussian mixture model (GMM) and additive noise by a Gaussian PDF, we derive the minimum mean square error (MMSE) estimator. The derived MMSE estimator is non-linear and the linear MMSE estimator is shown to be a special case. For speech signal corrupted by independent additive noise, by modeling the joint PDF of time-domain speech samples of a speech frame using a GMM, we propose a speech enhancement method based on the derived MMSE estimator. We also show that the same estimator can be used for transform-domain speech enhancement.
New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis
Resumo:
This paper describes a new flexible delexicalization method based on glottal excited parametric speech synthesis scheme. The system utilizes inverse filtered glottal flow and all-pole modelling of the vocal tract. The method provides a possibil- ity to retain and manipulate all relevant prosodic features of any kind of speech. Most importantly, the features include voice quality, which has not been properly modeled in earlier delex- icalization methods. The functionality of the new method was tested in a prosodic tagging experiment aimed at providing word prominence data for a text-to-speech synthesis system. The ex- periment confirmed the usefulness of the method and further corroborated earlier evidence that linguistic factors influence the perception of prosodic prominence.
Resumo:
We propose a simple speech music discriminator that uses features based on HILN(Harmonics, Individual Lines and Noise) model. We have been able to test the strength of the feature set on a standard database of 66 files and get an accuracy of around 97%. We also have tested on sung queries and polyphonic music and have got very good results. The current algorithm is being used to discriminate between sung queries and played (using an instrument like flute) queries for a Query by Humming(QBH) system currently under development in the lab.
Resumo:
Non-uniform sampling of a signal is formulated as an optimization problem which minimizes the reconstruction signal error. Dynamic programming (DP) has been used to solve this problem efficiently for a finite duration signal. Further, the optimum samples are quantized to realize a speech coder. The quantizer and the DP based optimum search for non-uniform samples (DP-NUS) can be combined in a closed-loop manner, which provides distinct advantage over the open-loop formulation. The DP-NUS formulation provides a useful control over the trade-off between bitrate and performance (reconstruction error). It is shown that 5-10 dB SNR improvement is possible using DP-NUS compared to extrema sampling approach. In addition, the close-loop DP-NUS gives a 4-5 dB improvement in reconstruction error.
Resumo:
This paper describes a method of automated segmentation of speech assuming the signal is continuously time varying rather than the traditional short time stationary model. It has been shown that this representation gives comparable if not marginally better results than the other techniques for automated segmentation. A formulation of the 'Bach' (music semitonal) frequency scale filter-bank is proposed. A comparative study has been made of the performances using Mel, Bark and Bach scale filter banks considering this model. The preliminary results show up to 80 % matches within 20 ms of the manually segmented data, without any information of the content of the text and without any language dependence. 'Bach' filters are seen to marginally outperform the other filters.
Resumo:
This correspondence describes a method for automated segmentation of speech. The method proposed in this paper uses a specially designed filter-bank called Bach filter-bank which makes use of 'music' related perception criteria. The speech signal is treated as continuously time varying signal as against a short time stationary model. A comparative study has been made of the performances using Mel, Bark and Bach scale filter banks. The preliminary results show up to 80 % matches within 20 ms of the manually segmented data, without any information of the content of the text and without any language dependence. The Bach filters are seen to marginally outperform the other filters.
Resumo:
Joint decoding of multiple speech patterns so as to improve speech recognition performance is important, especially in the presence of noise. In this paper, we propose a Multi-Pattern Viterbi algorithm (MPVA) to jointly decode and recognize multiple speech patterns for automatic speech recognition (ASR). The MPVA is a generalization of the Viterbi Algorithm to jointly decode multiple patterns given a Hidden Markov Model (HMM). Unlike the previously proposed two stage Constrained Multi-Pattern Viterbi Algorithm (CMPVA),the MPVA is a single stage algorithm. MPVA has the advantage that it cart be extended to connected word recognition (CWR) and continuous speech recognition (CSR) problems. MPVA is shown to provide better speech recognition performance than the earlier techniques: using only two repetitions of noisy speech patterns (-5 dB SNR, 10% burst noise), the word error rate using MPVA decreased by 28.5%, when compared to using individual decoding. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The study analyses the ambivalent relationship republicanism, as a form of self-government free from domination, had with the ideal of participatory oratory and non-dominated speech on the one hand, and with the danger of unhindered demagogy and its possibly fatal consequences to that form of government on the other. Although previous scholarship has delved deeply into republicanism as well as into rhetoric and public speech, the interplay between those aspects has only gathered scattered interest, and there has been no systematic study considering the variety of republican approaches to rhetoric and public speech in 17th-century England. The rare attempts to do so have been studies in English literature, and they have not analysed the political philosophy of republicanism, as the focus has been on republicanism as a literary culture. This study connects the fields of political theory, political history as well as literature in order to make a multidisciplinary contribution to intellectual history. The study shows that, within the tradition of classical republicanism, individual authors could make different choices when addressing the problematic topics of public speech and rhetoric, and the variety of their conclusions often set the authors against each other, resulting in the development of their theories through internal debates within the republican tradition. The authors under study were chosen to reflect this variety and the connections between them: the similarities between James Harrington and John Streater, and between John Milton and John Hall of Durham are shown, as well the controversies between Harrington and Milton, and Streater and Hall, respectively. In addition, by analysing the writings of Marchamont Nedham the study will show that the choices were not limited to more, or less, democratic brands of republicanism. Most significantly, the study provides a thorough analysis of the political philosophies behind the various brands of republicanism, in addition to describing them. By means of this analysis, the study shows that previous attempts to assess the role of free speech and public debate, through the lenses of modern, rights-based liberal political theory have resulted in an inappropriate framework for understanding early modern English republicanism. By approaching the topics through concepts used by the republicans legitimate authority, leadership by oratory, and republican freedom and through the frames of reference available and familiar to them roles of education and institutions the study presents a thorough and systematic analysis of the role and function of rhetoric and public speech in English republicanism. The findings of this analysis have significant consequences to our current understanding of the history and development of republican political theory, and, more generally, of the connections between democratic theory and free speech.
Resumo:
We use parallel weighted finite-state transducers to implement a part-of-speech tagger, which obtains state-of-the-art accuracy when used to tag the Europarl corpora for Finnish, Swedish and English. Our system consists of a weighted lexicon and a guesser combined with a bigram model factored into two weighted transducers. We use both lemmas and tag sequences in the bigram model, which guarantees reliable bigram estimates.
Resumo:
A new method based on unit continuity metric (UCM) is proposed for optimal unit selection in text-to-speech (TTS) synthesis. UCM employs two features, namely, pitch continuity metric and spectral continuity metric. The methods have been implemented and tested on our test bed called MILE-TTS and it is available as web demo. After verification by a self selection test, the algorithms are evaluated on 8 paragraphs each for Kannada and Tamil by native users of the languages. Mean-opinion-score (MOS) shows that naturalness and comprehension are better with UCM based algorithm than the non-UCM based ones. The naturalness of the TTS output is further enhanced by a new rule based algorithm for pause prediction for Tamil language. The pauses between the words are predicted based on parts-of-speech information obtained from the input text.
Resumo:
The paper describes a modular, unit selection based TTS framework, which can be used as a research bed for developing TTS in any new language, as well as studying the effect of changing any parameter during synthesis. Using this framework, TTS has been developed for Tamil. Synthesis database consists of 1027 phonetically rich prerecorded sentences. This framework has already been tested for Kannada. Our TTS synthesizes intelligible and acceptably natural speech, as supported by high mean opinion scores. The framework is further optimized to suit embedded applications like mobiles and PDAs. We compressed the synthesis speech database with standard speech compression algorithms used in commercial GSM phones and evaluated the quality of the resultant synthesized sentences. Even with a highly compressed database, the synthesized output is perceptually close to that with uncompressed database. Through experiments, we explored the ambiguities in human perception when listening to Tamil phones and syllables uttered in isolation,thus proposing to exploit the misperception to substitute for missing phone contexts in the database. Listening experiments have been conducted on sentences synthesized by deliberately replacing phones with their confused ones.
Resumo:
Traditional subspace based speech enhancement (SSE)methods use linear minimum mean square error (LMMSE) estimation that is optimal if the Karhunen Loeve transform (KLT) coefficients of speech and noise are Gaussian distributed. In this paper, we investigate the use of Gaussian mixture (GM) density for modeling the non-Gaussian statistics of the clean speech KLT coefficients. Using Gaussian mixture model (GMM), the optimum minimum mean square error (MMSE) estimator is found to be nonlinear and the traditional LMMSE estimator is shown to be a special case. Experimental results show that the proposed method provides better enhancement performance than the traditional subspace based methods.Index Terms: Subspace based speech enhancement, Gaussian mixture density, MMSE estimation.
Resumo:
We formulate a two-stage Iterative Wiener filtering (IWF) approach to speech enhancement, bettering the performance of constrained IWF, reported in literature. The codebook constrained IWF (CCIWF) has been shown to be effective in achieving convergence of IWF in the presence of both stationary and non-stationary noise. To this, we include a second stage of unconstrained IWF and show that the speech enhancement performance can be improved in terms of average segmental SNR (SSNR), Itakura-Saito (IS) distance and Linear Prediction Coefficients (LPC) parameter coincidence. We also explore the tradeoff between the number of CCIWF iterations and the second stage IWF iterations.