925 resultados para Compressed speech
New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis
Resumo:
This paper describes a new flexible delexicalization method based on glottal excited parametric speech synthesis scheme. The system utilizes inverse filtered glottal flow and all-pole modelling of the vocal tract. The method provides a possibil- ity to retain and manipulate all relevant prosodic features of any kind of speech. Most importantly, the features include voice quality, which has not been properly modeled in earlier delex- icalization methods. The functionality of the new method was tested in a prosodic tagging experiment aimed at providing word prominence data for a text-to-speech synthesis system. The ex- periment confirmed the usefulness of the method and further corroborated earlier evidence that linguistic factors influence the perception of prosodic prominence.
Resumo:
We propose a simple speech music discriminator that uses features based on HILN(Harmonics, Individual Lines and Noise) model. We have been able to test the strength of the feature set on a standard database of 66 files and get an accuracy of around 97%. We also have tested on sung queries and polyphonic music and have got very good results. The current algorithm is being used to discriminate between sung queries and played (using an instrument like flute) queries for a Query by Humming(QBH) system currently under development in the lab.
Resumo:
Non-uniform sampling of a signal is formulated as an optimization problem which minimizes the reconstruction signal error. Dynamic programming (DP) has been used to solve this problem efficiently for a finite duration signal. Further, the optimum samples are quantized to realize a speech coder. The quantizer and the DP based optimum search for non-uniform samples (DP-NUS) can be combined in a closed-loop manner, which provides distinct advantage over the open-loop formulation. The DP-NUS formulation provides a useful control over the trade-off between bitrate and performance (reconstruction error). It is shown that 5-10 dB SNR improvement is possible using DP-NUS compared to extrema sampling approach. In addition, the close-loop DP-NUS gives a 4-5 dB improvement in reconstruction error.
Resumo:
This paper describes a method of automated segmentation of speech assuming the signal is continuously time varying rather than the traditional short time stationary model. It has been shown that this representation gives comparable if not marginally better results than the other techniques for automated segmentation. A formulation of the 'Bach' (music semitonal) frequency scale filter-bank is proposed. A comparative study has been made of the performances using Mel, Bark and Bach scale filter banks considering this model. The preliminary results show up to 80 % matches within 20 ms of the manually segmented data, without any information of the content of the text and without any language dependence. 'Bach' filters are seen to marginally outperform the other filters.
Resumo:
This correspondence describes a method for automated segmentation of speech. The method proposed in this paper uses a specially designed filter-bank called Bach filter-bank which makes use of 'music' related perception criteria. The speech signal is treated as continuously time varying signal as against a short time stationary model. A comparative study has been made of the performances using Mel, Bark and Bach scale filter banks. The preliminary results show up to 80 % matches within 20 ms of the manually segmented data, without any information of the content of the text and without any language dependence. The Bach filters are seen to marginally outperform the other filters.
Resumo:
Joint decoding of multiple speech patterns so as to improve speech recognition performance is important, especially in the presence of noise. In this paper, we propose a Multi-Pattern Viterbi algorithm (MPVA) to jointly decode and recognize multiple speech patterns for automatic speech recognition (ASR). The MPVA is a generalization of the Viterbi Algorithm to jointly decode multiple patterns given a Hidden Markov Model (HMM). Unlike the previously proposed two stage Constrained Multi-Pattern Viterbi Algorithm (CMPVA),the MPVA is a single stage algorithm. MPVA has the advantage that it cart be extended to connected word recognition (CWR) and continuous speech recognition (CSR) problems. MPVA is shown to provide better speech recognition performance than the earlier techniques: using only two repetitions of noisy speech patterns (-5 dB SNR, 10% burst noise), the word error rate using MPVA decreased by 28.5%, when compared to using individual decoding. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The study analyses the ambivalent relationship republicanism, as a form of self-government free from domination, had with the ideal of participatory oratory and non-dominated speech on the one hand, and with the danger of unhindered demagogy and its possibly fatal consequences to that form of government on the other. Although previous scholarship has delved deeply into republicanism as well as into rhetoric and public speech, the interplay between those aspects has only gathered scattered interest, and there has been no systematic study considering the variety of republican approaches to rhetoric and public speech in 17th-century England. The rare attempts to do so have been studies in English literature, and they have not analysed the political philosophy of republicanism, as the focus has been on republicanism as a literary culture. This study connects the fields of political theory, political history as well as literature in order to make a multidisciplinary contribution to intellectual history. The study shows that, within the tradition of classical republicanism, individual authors could make different choices when addressing the problematic topics of public speech and rhetoric, and the variety of their conclusions often set the authors against each other, resulting in the development of their theories through internal debates within the republican tradition. The authors under study were chosen to reflect this variety and the connections between them: the similarities between James Harrington and John Streater, and between John Milton and John Hall of Durham are shown, as well the controversies between Harrington and Milton, and Streater and Hall, respectively. In addition, by analysing the writings of Marchamont Nedham the study will show that the choices were not limited to more, or less, democratic brands of republicanism. Most significantly, the study provides a thorough analysis of the political philosophies behind the various brands of republicanism, in addition to describing them. By means of this analysis, the study shows that previous attempts to assess the role of free speech and public debate, through the lenses of modern, rights-based liberal political theory have resulted in an inappropriate framework for understanding early modern English republicanism. By approaching the topics through concepts used by the republicans legitimate authority, leadership by oratory, and republican freedom and through the frames of reference available and familiar to them roles of education and institutions the study presents a thorough and systematic analysis of the role and function of rhetoric and public speech in English republicanism. The findings of this analysis have significant consequences to our current understanding of the history and development of republican political theory, and, more generally, of the connections between democratic theory and free speech.
Resumo:
We use parallel weighted finite-state transducers to implement a part-of-speech tagger, which obtains state-of-the-art accuracy when used to tag the Europarl corpora for Finnish, Swedish and English. Our system consists of a weighted lexicon and a guesser combined with a bigram model factored into two weighted transducers. We use both lemmas and tag sequences in the bigram model, which guarantees reliable bigram estimates.
Resumo:
A new method based on unit continuity metric (UCM) is proposed for optimal unit selection in text-to-speech (TTS) synthesis. UCM employs two features, namely, pitch continuity metric and spectral continuity metric. The methods have been implemented and tested on our test bed called MILE-TTS and it is available as web demo. After verification by a self selection test, the algorithms are evaluated on 8 paragraphs each for Kannada and Tamil by native users of the languages. Mean-opinion-score (MOS) shows that naturalness and comprehension are better with UCM based algorithm than the non-UCM based ones. The naturalness of the TTS output is further enhanced by a new rule based algorithm for pause prediction for Tamil language. The pauses between the words are predicted based on parts-of-speech information obtained from the input text.
Resumo:
It is possible to sample signals at sub-Nyquist rate and still be able to reconstruct them with reasonable accuracy provided they exhibit local Fourier sparsity. Underdetermined systems of equations, which arise out of undersampling, have been solved to yield sparse solutions using compressed sensing algorithms. In this paper, we propose a framework for real time sampling of multiple analog channels with a single A/D converter achieving higher effective sampling rate. Signal reconstruction from noisy measurements on two different synthetic signals has been presented. A scheme of implementing the algorithm in hardware has also been suggested.
Resumo:
In this paper we propose that the compressive tidal held in the centers of flat-core early-type galaxies and ultraluminous galaxies compresses molecular clouds producing dense gas observed in the centers of these galaxies. The effect of galactic tidal fields is usually considered disruptive in the literature. However, for some galaxies, the mass profile flattens toward the center and the resulting galactic tidal field is not disruptive, but instead it is compressive within the flat-core region. We have used the virial theorem to determine the minimum density of a molecular cloud to be stable and gravitationally bound within the tidally compressive region of a galaxy. We have applied the mechanism to determine the mean molecular cloud densities in the centers of a sample of flat-core, early-type galaxies and ultraluminous galaxies. For early-type galaxies with a core-type luminosity profile, the tidal held of the galaxy is compressive within half the core radius. We have calculated the mean gas densities for molecular gas in a sample of early-type galaxies which have already been detected in CO emission, and we obtain mean densities of [n] similar to 10(3)-10(6) cm(-3) within the central 100 pc radius. We also use our model to calculate the molecular cloud densities in the inner few hundred parsecs of a sample of ultraluminous galaxies. From the observed rotation curves of these galaxies we show that they have a compressive core within their nuclear region. Our model predicts minimum molecular gas densities in the range 10(2)-10(4) cm(-3) in the nuclear gas disks; the smaller values are applicable typically for galaxies with larger core radii. The resulting density values agree well with the observed range. Also, for large core radii, even fairly low-density gas (similar to 10(2) cm(-3)) can remain bound and stable close to the galactic center.
Resumo:
Traditional subspace based speech enhancement (SSE)methods use linear minimum mean square error (LMMSE) estimation that is optimal if the Karhunen Loeve transform (KLT) coefficients of speech and noise are Gaussian distributed. In this paper, we investigate the use of Gaussian mixture (GM) density for modeling the non-Gaussian statistics of the clean speech KLT coefficients. Using Gaussian mixture model (GMM), the optimum minimum mean square error (MMSE) estimator is found to be nonlinear and the traditional LMMSE estimator is shown to be a special case. Experimental results show that the proposed method provides better enhancement performance than the traditional subspace based methods.Index Terms: Subspace based speech enhancement, Gaussian mixture density, MMSE estimation.
Resumo:
We formulate a two-stage Iterative Wiener filtering (IWF) approach to speech enhancement, bettering the performance of constrained IWF, reported in literature. The codebook constrained IWF (CCIWF) has been shown to be effective in achieving convergence of IWF in the presence of both stationary and non-stationary noise. To this, we include a second stage of unconstrained IWF and show that the speech enhancement performance can be improved in terms of average segmental SNR (SSNR), Itakura-Saito (IS) distance and Linear Prediction Coefficients (LPC) parameter coincidence. We also explore the tradeoff between the number of CCIWF iterations and the second stage IWF iterations.
Resumo:
Effective feature extraction for robust speech recognition is a widely addressed topic and currently there is much effort to invoke non-stationary signal models instead of quasi-stationary signal models leading to standard features such as LPC or MFCC. Joint amplitude modulation and frequency modulation (AM-FM) is a classical non-parametric approach to non-stationary signal modeling and recently new feature sets for automatic speech recognition (ASR) have been derived based on a multi-band AM-FM representation of the signal. We consider several of these representations and compare their performances for robust speech recognition in noise, using the AURORA-2 database. We show that FEPSTRUM representation proposed is more effective than others. We also propose an improvement to FEPSTRUM based on the Teager energy operator (TEO) and show that it can selectively outperform even FEPSTRUM