342 resultados para Speaker Recognition


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Boltzmann machines offer a new and exciting approach to automatic speech recognition, and provide a rigorous mathematical formalism for parallel computing arrays. In this paper we briefly summarize Boltzmann machine theory, and present results showing their ability to recognize both static and time-varying speech patterns. A machine with 2000 units was able to distinguish between the 11 steady-state vowels in English with an accuracy of 85%. The stability of the learning algorithm and methods of preprocessing and coding speech data before feeding it to the machine are also discussed. A new type of unit called a carry input unit, which involves a type of state-feedback, was developed for the processing of time-varying patterns and this was tested on a few short sentences. Use is made of the implications of recent work into associative memory, and the modelling of neural arrays to suggest a good configuration of Boltzmann machines for this sort of pattern recognition.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

VODIS II, a research system in which recognition is based on the conventional one-pass connected-word algorithm extended in two ways, is described. Syntactic constraints can now be applied directly via context-free-grammar rules, and the algorithm generates a lattice of candidate word matches rather than a single globally optimal sequence. This lattice is then processed by a chart parser and an intelligent dialogue controller to obtain the most plausible interpretations of the input. A key feature of the VODIS II architecture is that the concept of an abstract word model allows the system to be used with different pattern-matching technologies and hardware. The current system implements the word models on a real-time dynamic-time-warping recognizer.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The use of variable-width features (prosodics, broad structural information etc.) in large vocabulary speech recognition systems is discussed. Although the value of this sort of information has been recognized in the past, previous approaches have not been widely used in speech systems because either they have not been robust enough for realistic, large vocabulary tasks or they have been limited to certain recognizer architectures. A framework for the use of variable-width features is presented which employs the N-Best algorithm with the features being applied in a post-processing phase. The framework is flexible and widely applicable, giving greater scope for exploitation of the features than previous approaches. Large vocabulary speech recognition experiments using TIMIT show that the application of variable-width features has potential benefits.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper reports our experiences with a phoneme recognition system for the TIMIT database which uses multiple mixture continuous density monophone HMMs trained using MMI. A comprehensive set of results are presented comparing the ML and MMI training criteria for both diagonal and full covariance models. These results using simple monophone HMMs show clear performance gains achieved by MMI training, and are comparable to the best reported by others including those which use context-dependent models. In addition, the paper discusses a number of performance and implementation issues which are crucial to successful MMI training.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A parallel processing network derived from Kanerva's associative memory theory Kanerva 1984 is shown to be able to train rapidly on connected speech data and recognize further speech data with a label error rate of 0·68%. This modified Kanerva model can be trained substantially faster than other networks with comparable pattern discrimination properties. Kanerva presented his theory of a self-propagating search in 1984, and showed theoretically that large-scale versions of his model would have powerful pattern matching properties. This paper describes how the design for the modified Kanerva model is derived from Kanerva's original theory. Several designs are tested to discover which form may be implemented fastest while still maintaining versatile recognition performance. A method is developed to deal with the time varying nature of the speech signal by recognizing static patterns together with a fixed quantity of contextual information. In order to recognize speech features in different contexts it is necessary for a network to be able to model disjoint pattern classes. This type of modelling cannot be performed by a single layer of links. Network research was once held back by the inability of single-layer networks to solve this sort of problem, and the lack of a training algorithm for multi-layer networks. Rumelhart, Hinton & Williams 1985 provided one solution by demonstrating the "back propagation" training algorithm for multi-layer networks. A second alternative is used in the modified Kanerva model. A non-linear fixed transformation maps the pattern space into a space of higher dimensionality in which the speech features are linearly separable. A single-layer network may then be used to perform the recognition. The advantage of this solution over the other using multi-layer networks lies in the greater power and speed of the single-layer network training algorithm. © 1989.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In spite of over two decades of intense research, illumination and pose invariance remain prohibitively challenging aspects of face recognition for most practical applications. The objective of this work is to recognize faces using video sequences both for training and recognition input, in a realistic, unconstrained setup in which lighting, pose and user motion pattern have a wide variability and face images are of low resolution. In particular there are three areas of novelty: (i) we show how a photometric model of image formation can be combined with a statistical model of generic face appearance variation, learnt offline, to generalize in the presence of extreme illumination changes; (ii) we use the smoothness of geodesically local appearance manifold structure and a robust same-identity likelihood to achieve invariance to unseen head poses; and (iii) we introduce an accurate video sequence "reillumination" algorithm to achieve robustness to face motion patterns in video. We describe a fully automatic recognition system based on the proposed method and an extensive evaluation on 171 individuals and over 1300 video sequences with extreme illumination, pose and head motion variation. On this challenging data set our system consistently demonstrated a nearly perfect recognition rate (over 99.7%), significantly outperforming state-of-the-art commercial software and methods from the literature. © Springer-Verlag Berlin Heidelberg 2006.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

As the use of found data increases, more systems are being built using adaptive training. Here transforms are used to represent unwanted acoustic variability, e.g. speaker and acoustic environment changes, allowing a canonical model that models only the "pure" variability of speech to be trained. Adaptive training may be described within a Bayesian framework. By using complexity control approaches to ensure robust parameter estimates, the standard point estimate adaptive training can be justified within this Bayesian framework. However during recognition there is usually no control over the amount of data available. It is therefore preferable to be able to use a full Bayesian approach to applying transforms during recognition rather than the standard point estimates. This paper discusses various approximations to Bayesian approaches including a new variational Bayes approximation. The application of these approaches to state-of-the-art adaptively trained systems using both CAT and MLLR transforms is then described and evaluated on a large vocabulary speech recognition task. © 2005 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. We motivate five simple cues designed to model specific patterns of motion and 3D world structure that vary with object category. We introduce features that project the 3D cues back to the 2D image plane while modeling spatial layout and context. A randomized decision forest combines many such features to achieve a coherent 2D segmentation and recognize the object categories present. Our main contribution is to show how semantic segmentation is possible based solely on motion-derived 3D world structure. Our method works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors. Experiments were performed on a challenging new video database containing sequences filmed from a moving car in daylight and at dusk. The results confirm that indeed, accurate segmentation and recognition are possible using only motion and 3D world structure. Further, we show that the motion-derived information complements an existing state-of-the-art appearance-based method, improving both qualitative and quantitative performance. © 2008 Springer Berlin Heidelberg.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a novel label processor which can recognize multiple spectral-amplitude-code labels using four-wave-mixing sidebands and selective optical filtering. Ten code-labels x 10 Gbps variable-length packets are transmitted over a 200 km single-hop switched network.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Model compensation is a standard way of improving the robustness of speech recognition systems to noise. A number of popular schemes are based on vector Taylor series (VTS) compensation, which uses a linear approximation to represent the influence of noise on the clean speech. To compensate the dynamic parameters, the continuous time approximation is often used. This approximation uses a point estimate of the gradient, which fails to take into account that dynamic coefficients are a function of a number of consecutive static coefficients. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to distributions over standard static and dynamic features. With this improved approximation, it is also possible to obtain full-covariance corrupted speech distributions. This addresses the correlation changes that occur in noise. The proposed scheme outperformed the standard VTS scheme by 10% to 20% relative on a range of tasks. © 2006 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For many realistic scenarios, there are multiple factors that affect the clean speech signal. In this work approaches to handling two such factors, speaker and background noise differences, simultaneously are described. A new adaptation scheme is proposed. Here the acoustic models are first adapted to the target speaker via an MLLR transform. This is followed by adaptation to the target noise environment via model-based vector Taylor series (VTS) compensation. These speaker and noise transforms are jointly estimated, using maximum likelihood. Experiments on the AURORA4 task demonstrate that this adaptation scheme provides improved performance over VTS-based noise adaptation. In addition, this framework enables the speech and noise to be factorised, allowing the speaker transform estimated in one noise condition to be successfully used in a different noise condition. © 2011 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Discriminative mapping transforms (DMTs) is an approach to robustly adding discriminative training to unsupervised linear adaptation transforms. In unsupervised adaptation DMTs are more robust to unreliable transcriptions than directly estimating adaptation transforms in a discriminative fashion. They were previously proposed for use with MLLR transforms with the associated need to explicitly transform the model parameters. In this work the DMT is extended to CMLLR transforms. As these operate in the feature space, it is only necessary to apply a different linear transform at the front-end rather than modifying the model parameters. This is useful for rapidly changing speakers/environments. The performance of DMTs with CMLLR was evaluated on the WSJ 20k task. Experimental results show that DMTs based on constrained linear transforms yield 3% to 6% relative gain over MLE transforms in unsupervised speaker adaptation. © 2011 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Recently there has been interest in structured discriminative models for speech recognition. In these models sentence posteriors are directly modelled, given a set of features extracted from the observation sequence, and hypothesised word sequence. In previous work these discriminative models have been combined with features derived from generative models for noise-robust speech recognition for continuous digits. This paper extends this work to medium to large vocabulary tasks. The form of the score-space extracted using the generative models, and parameter tying of the discriminative model, are both discussed. Update formulae for both conditional maximum likelihood and minimum Bayes' risk training are described. Experimental results are presented on small and medium to large vocabulary noise-corrupted speech recognition tasks: AURORA 2 and 4. © 2011 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Structured precision modelling is an important approach to improve the intra-frame correlation modelling of the standard HMM, where Gaussian mixture model with diagonal covariance are used. Previous work has all been focused on direct structured representation of the precision matrices. In this paper, a new framework is proposed, where the structure of the Cholesky square root of the precision matrix is investigated, referred to as Cholesky Basis Superposition (CBS). Each Cholesky matrix associated with a particular Gaussian distribution is represented as a linear combination of a set of Gaussian independent basis upper-triangular matrices. Efficient optimization methods are derived for both combination weights and basis matrices. Experiments on a Chinese dictation task showed that the proposed approach can significantly outperformed the direct structured precision modelling with similar number of parameters as well as full covariance modelling. © 2011 IEEE.