373 resultados para speech features
Resumo:
The task in keyword spotting (KWS) is to hypothesise times at which any of a set of key terms occurs in audio. An important aspect of such systems are the scores assigned to these hypotheses, the accuracy of which have a significant impact on performance. Estimating these scores may be formulated as a confidence estimation problem, where a measure of confidence is assigned to each key term hypothesis. In this work, a set of discriminative features is defined, and combined using a conditional random field (CRF) model for improved confidence estimation. An extension to this model to directly address the problem of score normalisation across key terms is also introduced. The implicit score normalisation which results from applying this approach to separate systems in a hybrid configuration yields further benefits. Results are presented which show notable improvements in KWS performance using the techniques presented in this work. © 2013 IEEE.
Resumo:
A partially observable Markov decision process has been proposed as a dialogue model that enables robustness to speech recognition errors and automatic policy optimisation using reinforcement learning (RL). However, conventional RL algorithms require a very large number of dialogues, necessitating a user simulator. Recently, Gaussian processes have been shown to substantially speed up the optimisation, making it possible to learn directly from interaction with human users. However, early studies have been limited to very low dimensional spaces and the learning has exhibited convergence problems. Here we investigate learning from human interaction using the Bayesian Update of Dialogue State system. This dynamic Bayesian network based system has an optimisation space covering more than one hundred features, allowing a wide range of behaviours to be learned. Using an improved policy model and a more robust reward function, we show that stable learning can be achieved that significantly outperforms a simulator trained policy. © 2013 IEEE.
Resumo:
Within the spectrum of extratesticular mesenchymal tumors in the scrotum and perineum lies cellular angiofibroma, also known as angiomyofibroblastoma-like tumor, a rare lesion originally described to almost exclusively occur in the vulva, perineum, and pelvis of women. We report a case of this tumor, with an adjacent scrotal lipoma, occurring in a 60-year-old male who presented to our department with a firm palpable scrotal mass. To our knowledge, the MRI findings of this entity have yet to be described in the radiological literature. We present the MRI features of cellular angiofibroma that are consistent with the pathological characteristics of this entity-a benign cellular and fibrous tumor with prominent vascularity.
Resumo:
We experimentally demonstrate the planar focusing of Surface Plasmon Polaritons using space variant PMMA subwavelength features on top of a metallic film. Focusing is obtained by creating an effective graded refractive index profile. © 2012 OSA.
Resumo:
Adaptation to speaker and environment changes is an essential part of current automatic speech recognition (ASR) systems. In recent years the use of multi-layer percpetrons (MLPs) has become increasingly common in ASR systems. A standard approach to handling speaker differences when using MLPs is to apply a global speaker-specific constrained MLLR (CMLLR) transform to the features prior to training or using the MLP. This paper considers the situation when there are both speaker and channel, communication link, differences in the data. A more powerful transform, front-end CMLLR (FE-CMLLR), is applied to the inputs to the MLP to represent the channel differences. Though global, these FE-CMLLR transforms vary from time-instance to time-instance. Experiments on a channel distorted dialect Arabic conversational speech recognition task indicates the usefulness of adapting MLP features using both CMLLR and FE-CMLLR transforms. © 2013 IEEE.
Resumo:
State-of-the-art speech recognisers are usually based on hidden Markov models (HMMs). They model a hidden symbol sequence with a Markov process, with the observations independent given that sequence. These assumptions yield efficient algorithms, but limit the power of the model. An alternative model that allows a wide range of features, including word- and phone-level features, is a log-linear model. To handle, for example, word-level variable-length features, the original feature vectors must be segmented into words. Thus, decoding must find the optimal combination of segmentation of the utterance into words and word sequence. Features must therefore be extracted for each possible segment of audio. For many types of features, this becomes slow. In this paper, long-span features are derived from the likelihoods of word HMMs. Derivatives of the log-likelihoods, which break the Markov assumption, are appended. Previously, decoding with this model took cubic time in the length of the sequence, and longer for higher-order derivatives. This paper shows how to decode in quadratic time. © 2013 IEEE.
Resumo:
Large margin criteria and discriminative models are two effective improvements for HMM-based speech recognition. This paper proposed a large margin trained log linear model with kernels for CSR. To avoid explicitly computing in the high dimensional feature space and to achieve the nonlinear decision boundaries, a kernel based training and decoding framework is proposed in this work. To make the system robust to noise a kernel adaptation scheme is also presented. Previous work in this area is extended in two directions. First, most kernels for CSR focus on measuring the similarity between two observation sequences. The proposed joint kernels defined a similarity between two observation-label sequence pairs on the sentence level. Second, this paper addresses how to efficiently employ kernels in large margin training and decoding with lattices. To the best of our knowledge, this is the first attempt at using large margin kernel-based log linear models for CSR. The model is evaluated on a noise corrupted continuous digit task: AURORA 2.0. © 2013 IEEE.
Resumo:
We experimentally demonstrate the planar focusing of Surface Plasmon Polaritons using space variant PMMA subwavelength features on top of a metallic film. Focusing is obtained by creating an effective graded refractive index profile. © OSA 2012.
Resumo:
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (∼10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance. © 2013 IEEE.
Resumo:
This paper presents an overview of the Text-to-Speech synthesis system developed at the Institute for Language and Speech Processing (ILSP). It focuses on the key issues regarding the design of the system components. The system currently fully supports three languages (Greek, English, Bulgarian) and is designed in such a way to be as language and speaker independent as possible. Also, experimental results are presented which show that the system produces high quality synthetic speech in terms of naturalness and intelligibility. The system was recently ranked among the first three systems worldwide in terms of achieved quality for the English language, at the international Blizzard Challenge 2013 workshop. © 2014 Springer International Publishing.