61 resultados para speech databases


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Latent variable methods, such as PLCA (Probabilistic Latent Component Analysis) have been successfully used for analysis of non-negative signal representations. In this paper, we formulate PLCS (Probabilistic Latent Component Segmentation), which models each time frame of a spectrogram as a spectral distribution. Given the signal spectrogram, the segmentation boundaries are estimated using a maximum-likelihood approach. For an efficient solution, the algorithm imposes a hard constraint that each segment is modelled by a single latent component. The hard constraint facilitates the solution of ML boundary estimation using dynamic programming. The PLCS framework does not impose a parametric assumption unlike earlier ML segmentation techniques. PLCS can be naturally extended to model coarticulation between successive phones. Experiments on the TIMIT corpus show that the proposed technique is promising compared to most state of the art speech segmentation algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Epoch is defined as the instant of significant excitation within a pitch period of voiced speech. Epoch extraction continues to attract the interest of researchers because of its significance in speech analysis. Existing high performance epoch extraction algorithms require either dynamic programming techniques or a priori information of the average pitch period. An algorithm without such requirements is proposed based on integrated linear prediction residual (ILPR) which resembles the voice source signal. Half wave rectified and negated ILPR (or Hilbert transform of ILPR) is used as the pre-processed signal. A new non-linear temporal measure named the plosion index (PI) has been proposed for detecting `transients' in speech signal. An extension of PI, called the dynamic plosion index (DPI) is applied on pre-processed signal to estimate the epochs. The proposed DPI algorithm is validated using six large databases which provide simultaneous EGG recordings. Creaky and singing voice samples are also analyzed. The algorithm has been tested for its robustness in the presence of additive white and babble noise and on simulated telephone quality speech. The performance of the DPI algorithm is found to be comparable or better than five state-of-the-art techniques for the experiments considered.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Narrowband spectrograms of voiced speech can be modeled as an outcome of two-dimensional (2-D) modulation process. In this paper, we develop a demodulation algorithm to estimate the 2-D amplitude modulation (AM) and carrier of a given spectrogram patch. The demodulation algorithm is based on the Riesz transform, which is a unitary, shift-invariant operator and is obtained as a 2-D extension of the well known 1-D Hilbert transform operator. Existing methods for spectrogram demodulation rely on extension of sinusoidal demodulation method from the communications literature and require precise estimate of the 2-D carrier. On the other hand, the proposed method based on Riesz transform does not require a carrier estimate. The proposed method and the sinusoidal demodulation scheme are tested on real speech data. Experimental results show that the demodulated AM and carrier from Riesz demodulation represent the spectrogram patch more accurately compared with those obtained using the sinusoidal demodulation. The signal-to-reconstruction error ratio was found to be about 2 to 6 dB higher in case of the proposed demodulation approach.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper describes a spatio-temporal registration approach for speech articulation data obtained from electromagnetic articulography (EMA) and real-time Magnetic Resonance Imaging (rtMRI). This is motivated by the potential for combining the complementary advantages of both types of data. The registration method is validated on EMA and rtMRI datasets obtained at different times, but using the same stimuli. The aligned corpus offers the advantages of high temporal resolution (from EMA) and a complete mid-sagittal view (from rtMRI). The co-registration also yields optimum placement of EMA sensors as articulatory landmarks on the magnetic resonance images, thus providing richer spatio-temporal information about articulatory dynamics. (C) 2014 Acoustical Society of America

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We develop noise robust features using Gammatone wavelets derived from the popular Gammatone functions. These wavelets incorporate the characteristics of human peripheral auditory systems, in particular the spatially-varying frequency response of the basilar membrane. We refer to the new features as Gammatone Wavelet Cepstral Coefficients (GWCC). The procedure involved in extracting GWCC from a speech signal is similar to that of the conventional Mel-Frequency Cepstral Coefficients (MFCC) technique, with the difference being in the type of filterbank used. We replace the conventional mel filterbank in MFCC with a Gammatone wavelet filterbank, which we construct using Gammatone wavelets. We also explore the effect of Gammatone filterbank based features (Gammatone Cepstral Coefficients (GCC)) for robust speech recognition. On AURORA 2 database, a comparison of GWCCs and GCCs with MFCCs shows that Gammatone based features yield a better recognition performance at low SNRs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes an automatic acoustic-phonetic method for estimating voice-onset time of stops. This method requires neither transcription of the utterance nor training of a classifier. It makes use of the plosion index for the automatic detection of burst onsets of stops. Having detected the burst onset, the onset of the voicing following the burst is detected using the epochal information and a temporal measure named the maximum weighted inner product. For validation, several experiments are carried out on the entire TIMIT database and two of the CMU Arctic corpora. The performance of the proposed method compares well with three state-of-the-art techniques. (C) 2014 Acoustical Society of America

Relevância:

20.00% 20.00%

Publicador:

Resumo:

USC-TIMIT is an extensive database of multimodal speech production data, developed to complement existing resources available to the speech research community and with the intention of being continuously refined and augmented. The database currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English. Electromagnetic articulography data have also been presently collected from four of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus used previously in the MOCHA-TIMIT database. In both cases the audio signal was recorded and synchronized with the articulatory data. The database and companion software are freely available to the research community. (C) 2014 Acoustical Society of America.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

NrichD ( ext-link-type=''uri'' xlink:href=''http://proline.biochem.iisc.ernet.in/NRICHD/'' xlink:type=''simple''>http://proline.biochem.iisc.ernet.in/NRICHD/)< /named-content> is a database of computationally designed protein-like sequences, augmented into natural sequence databases that can perform hops in protein sequence space to assist in the detection of remote relationships. Establishing protein relationships in the absence of structural evidence or natural `intermediately related sequences' is a challenging task. Recently, we have demonstrated that the computational design of artificial intermediary sequences/linkers is an effective approach to fill naturally occurring voids in protein sequence space. Through a large-scale assessment we have demonstrated that such sequences can be plugged into commonly employed search databases to improve the performance of routinely used sequence search methods in detecting remote relationships. Since it is anticipated that such data sets will be employed to establish protein relationships, two databases that have already captured these relationships at the structural and functional domain level, namely, the SCOP database and the Pfam database, have been `enriched' with these artificial intermediary sequences. NrichD database currently contains 3 611 010 artificial sequences that have been generated between 27 882 pairs of families from 374 SCOP folds. The data sets are freely available for download. Additional features include the design of artificial sequences between any two protein families of interest to the user.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a two-dimensional (2-D) multicomponent amplitude-modulation, frequency-modulation (AM-FM) model for a spectrogram patch corresponding to voiced speech, and develop a new demodulation algorithm to effectively separate the AM, which is related to the vocal tract response, and the carrier, which is related to the excitation. The demodulation algorithm is based on the Riesz transform and is developed along the lines of Hilbert-transform-based demodulation for 1-D AM-FM signals. We compare the performance of the Riesz transform technique with that of the sinusoidal demodulation technique on real speech data. Experimental results show that the Riesz-transform-based demodulation technique represents spectrogram patches accurately. The spectrograms reconstructed from the demodulated AM and carrier are inverted and the corresponding speech signal is synthesized. The signal-to-noise ratio (SNR) of the reconstructed speech signal, with respect to clean speech, was found to be 2 to 4 dB higher in case of the Riesz transform technique than the sinusoidal demodulation technique.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We address the problem of separating a speech signal into its excitation and vocal-tract filter components, which falls within the framework of blind deconvolution. Typically, the excitation in case of voiced speech is assumed to be sparse and the vocal-tract filter stable. We develop an alternating l(p) - l(2) projections algorithm (ALPA) to perform deconvolution taking into account these constraints. The algorithm is iterative, and alternates between two solution spaces. The initialization is based on the standard linear prediction decomposition of a speech signal into an autoregressive filter and prediction residue. In every iteration, a sparse excitation is estimated by optimizing an l(p)-norm-based cost and the vocal-tract filter is derived as a solution to a standard least-squares minimization problem. We validate the algorithm on voiced segments of natural speech signals and show applications to epoch estimation. We also present comparisons with state-of-the-art techniques and show that ALPA gives a sparser impulse-like excitation, where the impulses directly denote the epochs or instants of significant excitation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we derive an approach for the effective utilization of thermodynamic data in phase-field simulations. While the most widely used methodology for multi-component alloys is following the work by Eiken et al. (2006), wherein, an extrapolative scheme is utilized in conjunction with the TQ interface for deriving the driving force for phase transformation, a corresponding simplistic method based on the formulation of a parabolic free-energy model incorporating all the thermodynamics has been laid out for binary alloys in the work by Folch and Plapp (2005). In the following, we extend this latter approach for multi-component alloys in the framework of the grand-potential formalism. The coupling is applied for the case of the binary eutectic solidification in the Cr-Ni alloy and two-phase solidification in the ternary eutectic alloy (Al-Cr-Ni). A thermodynamic justification entails the basis of the formulation and places it in context of the bigger picture of Integrated Computational Materials Engineering. (C) 2015 Elsevier Ltd. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose apractical, feature-level and score-level fusion approach by combining acoustic and estimated articulatory information for both text independent and text dependent speaker verification. From a practical point of view, we study how to improve speaker verification performance by combining dynamic articulatory information with the conventional acoustic features. On text independent speaker verification, we find that concatenating articulatory features obtained from measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the performance dramatically. However, since directly measuring articulatory data is not feasible in many real world applications, we also experiment with estimated articulatory features obtained through acoustic-to-articulatory inversion. We explore both feature level and score level fusion methods and find that the overall system performance is significantly enhanced even with estimated articulatory features. Such a performance boost could be due to the inter-speaker variation information embedded in the estimated articulatory features. Since the dynamics of articulation contain important information, we included inverted articulatory trajectories in text dependent speaker verification. We demonstrate that the articulatory constraints introduced by inverted articulatory features help to reject wrong password trials and improve the performance after score level fusion. We evaluate the proposed methods on the X-ray Microbeam database and the RSR 2015 database, respectively, for the aforementioned two tasks. Experimental results show that we achieve more than 15% relative equal error rate reduction for both speaker verification tasks. (C) 2015 Elsevier Ltd. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Oversmoothing of speech parameter trajectories is one of the causes for quality degradation of HMM-based speech synthesis. Various methods have been proposed to overcome this effect, the most recent ones being global variance (GV) and modulation-spectrum-based post-filter (MSPF). However, there is still a significant quality gap between natural and synthesized speech. In this paper, we propose a two-fold post-filtering technique to alleviate to a certain extent the oversmoothing of spectral and excitation parameter trajectories of HMM-based speech synthesis. For the spectral parameters, we propose a sparse coding-based post-filter to match the trajectories of synthetic speech to that of natural speech, and for the excitation trajectory, we introduce a perceptually motivated post-filter. Experimental evaluations show quality improvement compared with existing methods.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Speech enhancement in stationary noise is addressed using the ideal channel selection framework. In order to estimate the binary mask, we propose to classify each time-frequency (T-F) bin of the noisy signal as speech or noise using Discriminative Random Fields (DRF). The DRF function contains two terms - an enhancement function and a smoothing term. On each T-F bin, we propose to use an enhancement function based on likelihood ratio test for speech presence, while Ising model is used as smoothing function for spectro-temporal continuity in the estimated binary mask. The effect of the smoothing function over successive iterations is found to reduce musical noise as opposed to using only enhancement function. The binary mask is inferred from the noisy signal using Iterated Conditional Modes (ICM) algorithm. Sentences from NOIZEUS corpus are evaluated from 0 dB to 15 dB Signal to Noise Ratio (SNR) in 4 kinds of additive noise settings: additive white Gaussian noise, car noise, street noise and pink noise. The reconstructed speech using the proposed technique is evaluated in terms of average segmental SNR, Perceptual Evaluation of Speech Quality (PESQ) and Mean opinion Score (MOS).