960 resultados para audiovisual speech perception
Resumo:
Boltzmann machines offer a new and exciting approach to automatic speech recognition, and provide a rigorous mathematical formalism for parallel computing arrays. In this paper we briefly summarize Boltzmann machine theory, and present results showing their ability to recognize both static and time-varying speech patterns. A machine with 2000 units was able to distinguish between the 11 steady-state vowels in English with an accuracy of 85%. The stability of the learning algorithm and methods of preprocessing and coding speech data before feeding it to the machine are also discussed. A new type of unit called a carry input unit, which involves a type of state-feedback, was developed for the processing of time-varying patterns and this was tested on a few short sentences. Use is made of the implications of recent work into associative memory, and the modelling of neural arrays to suggest a good configuration of Boltzmann machines for this sort of pattern recognition.
Resumo:
This paper describes a speech coding technique that has been developed in order to provide a method of digitising speech at bit rates in the range 4. 8 to 8 kb/s, that is insensitive to the effects of acoustic background noise and bit errors on the digital link. The main aim has been to develop a coding scheme which provides speech quality and robustness against noise and errors that is similar to a 16000 b/s continuously variable slope delta (CVSD) coder, but which operates at half its data rate or less. A desirable aim was to keep the complexity of the coding scheme within the scope of what could reasonably be handled by current signal processing chips or by a single custom integrated circuit. Applications areas include mobile radio and small Satcomms terminals.
Resumo:
VODIS II, a research system in which recognition is based on the conventional one-pass connected-word algorithm extended in two ways, is described. Syntactic constraints can now be applied directly via context-free-grammar rules, and the algorithm generates a lattice of candidate word matches rather than a single globally optimal sequence. This lattice is then processed by a chart parser and an intelligent dialogue controller to obtain the most plausible interpretations of the input. A key feature of the VODIS II architecture is that the concept of an abstract word model allows the system to be used with different pattern-matching technologies and hardware. The current system implements the word models on a real-time dynamic-time-warping recognizer.
Resumo:
Four types of neural networks which have previously been established for speech recognition and tested on a small, seven-speaker, 100-sentence database are applied to the TIMIT database. The networks are a recurrent network phoneme recognizer, a modified Kanerva model morph recognizer, a compositional representation phoneme-to-word recognizer, and a modified Kanerva model morph-to-word recognizer. The major result is for the recurrent net, giving a phoneme recognition accuracy of 57% from the si and sx sentences. The Kanerva morph recognizer achieves 66.2% accuracy for a small subset of the sa and sx sentences. The results for the word recognizers are incomplete.
Resumo:
In recent years there has been a growing interest amongst the speech research community into the use of spectral estimators which circumvent the traditional quasi-stationary assumption and provide greater time-frequency (t-f) resolution than conventional spectral estimators, such as the short time Fourier power spectrum (STFPS). One distribution in particular, the Wigner distribution (WD), has attracted considerable interest. However, experimental studies have indicated that, despite its improved t-f resolution, employing the WD as the front end of speech recognition system actually reduces recognition performance; only by explicitly re-introducing t-f smoothing into the WD are recognition rates improved. In this paper we provide an explanation for these findings. By treating the spectral estimation problem as one of optimization of a bias variance trade off, we show why additional t-f smoothing improves recognition rates, despite reducing the t-f resolution of the spectral estimator. A practical adaptive smoothing algorithm is presented, whicy attempts to match the degree of smoothing introduced into the WD with the time varying quasi-stationary regions within the speech waveform. The recognition performance of the resulting adaptively smoothed estimator is found to be comparable to that of conventional filterbank estimators, yet the average temporal sampling rate of the resulting spectral vectors is reduced by around a factor of 10. © 1992.
Resumo:
The use of variable-width features (prosodics, broad structural information etc.) in large vocabulary speech recognition systems is discussed. Although the value of this sort of information has been recognized in the past, previous approaches have not been widely used in speech systems because either they have not been robust enough for realistic, large vocabulary tasks or they have been limited to certain recognizer architectures. A framework for the use of variable-width features is presented which employs the N-Best algorithm with the features being applied in a post-processing phase. The framework is flexible and widely applicable, giving greater scope for exploitation of the features than previous approaches. Large vocabulary speech recognition experiments using TIMIT show that the application of variable-width features has potential benefits.
Resumo:
A parallel processing network derived from Kanerva's associative memory theory Kanerva 1984 is shown to be able to train rapidly on connected speech data and recognize further speech data with a label error rate of 0·68%. This modified Kanerva model can be trained substantially faster than other networks with comparable pattern discrimination properties. Kanerva presented his theory of a self-propagating search in 1984, and showed theoretically that large-scale versions of his model would have powerful pattern matching properties. This paper describes how the design for the modified Kanerva model is derived from Kanerva's original theory. Several designs are tested to discover which form may be implemented fastest while still maintaining versatile recognition performance. A method is developed to deal with the time varying nature of the speech signal by recognizing static patterns together with a fixed quantity of contextual information. In order to recognize speech features in different contexts it is necessary for a network to be able to model disjoint pattern classes. This type of modelling cannot be performed by a single layer of links. Network research was once held back by the inability of single-layer networks to solve this sort of problem, and the lack of a training algorithm for multi-layer networks. Rumelhart, Hinton & Williams 1985 provided one solution by demonstrating the "back propagation" training algorithm for multi-layer networks. A second alternative is used in the modified Kanerva model. A non-linear fixed transformation maps the pattern space into a space of higher dimensionality in which the speech features are linearly separable. A single-layer network may then be used to perform the recognition. The advantage of this solution over the other using multi-layer networks lies in the greater power and speed of the single-layer network training algorithm. © 1989.
Resumo:
In this paper, a Decimative Spectral estimation method based on Eigenanalysis and SVD (Singular Value Decomposition) is presented and applied to speech signals in order to estimate Formant/Bandwidth values. The underlying model decomposes a signal into complex damped sinusoids. The algorithm is applied not only on speech samples but on a small amount of the autocorrelation coefficients of a speech frame as well, for finer estimation. Correct estimation of Formant/Bandwidth values depend on the model order thus, the requested number of poles. Overall, experimentation results indicate that the proposed methodology successfully estimates formant trajectories and their respective bandwidths.