14 resultados para speakers
em Cambridge University Engineering Department Publications Database
Resumo:
The separation of independent sources from mixed observed data is a fundamental and challenging problem. In many practical situations, observations may be modelled as linear mixtures of a number of source signals, i.e. a linear multi-input multi-output system. A typical example is speech recordings made in an acoustic environment in the presence of background noise and/or competing speakers. Other examples include EEG signals, passive sonar applications and cross-talk in data communications. In this paper, we propose iterative algorithms to solve the n × n linear time invariant system under two different constraints. Some existing solutions for 2 × 2 systems are reviewed and compared.
Resumo:
This paper describes results obtained using the modified Kanerva model to perform word recognition in continuous speech after being trained on the multi-speaker Alvey 'Hotel' speech corpus. Theoretical discoveries have recently enabled us to increase the speed of execution of part of the model by two orders of magnitude over that previously reported by Prager & Fallside. The memory required for the operation of the model has been similarly reduced. The recognition accuracy reaches 95% without syntactic constraints when tested on different data from seven trained speakers. Real time simulation of a model with 9,734 active units is now possible in both training and recognition modes using the Alvey PARSIFAL transputer array. The modified Kanerva model is a static network consisting of a fixed nonlinear mapping (location matching) followed by a single layer of conventional adaptive links. A section of preprocessed speech is transformed by the non-linear mapping to a high dimensional representation. From this intermediate representation a simple linear mapping is able to perform complex pattern discrimination to form the output, indicating the nature of the speech features present in the input window.
Resumo:
This paper investigates a method of automatic pronunciation scoring for use in computer-assisted language learning (CALL) systems. The method utilizes a likelihood-based `Goodness of Pronunciation' (GOP) measure which is extended to include individual thresholds for each phone based on both averaged native confidence scores and on rejection statistics provided by human judges. Further improvements are obtained by incorporating models of the subject's native language and by augmenting the recognition networks to include expected pronunciation errors. The various GOP measures are assessed using a specially recorded database of non-native speakers which has been annotated to mark phone-level pronunciation errors. Since pronunciation assessment is highly subjective, a set of four performance measures has been designed, each of them measuring different aspects of how well computer-derived phone-level scores agree with human scores. These performance measures are used to cross-validate the reference annotations and to assess the basic GOP algorithm and its refinements. The experimental results suggest that a likelihood-based pronunciation scoring metric can achieve usable performance, especially after applying the various enhancements.
Resumo:
Discriminative mapping transforms (DMTs) is an approach to robustly adding discriminative training to unsupervised linear adaptation transforms. In unsupervised adaptation DMTs are more robust to unreliable transcriptions than directly estimating adaptation transforms in a discriminative fashion. They were previously proposed for use with MLLR transforms with the associated need to explicitly transform the model parameters. In this work the DMT is extended to CMLLR transforms. As these operate in the feature space, it is only necessary to apply a different linear transform at the front-end rather than modifying the model parameters. This is useful for rapidly changing speakers/environments. The performance of DMTs with CMLLR was evaluated on the WSJ 20k task. Experimental results show that DMTs based on constrained linear transforms yield 3% to 6% relative gain over MLE transforms in unsupervised speaker adaptation. © 2011 IEEE.
Resumo:
There is increasing adoption of computer-based tools to support the product development process. Tolls include computer-aided design, computer-aided manufacture, systems engineering and product data management systems. The fact that companies choose to invest in tools might be regarded as evidence that tools, in aggregate, are perceived to possess business value through their application to engineering activities. Yet the ways in which value accrues from tool technology are poorly understood.
This report records the proceedings of an international workshop during which some novel approaches to improving our understanding of this problem of tool valuation were presented and debated. The value of methods and processes were also discussed. The workshop brought together British, Dutch, German and Italian researchers. The presenters included speakers from industry and academia (the University of Cambridge, the University of Magdeburg and the Politechnico de Torino)
The work presented showed great variety. Research methods include case studies, questionnaires, statistical analysis, semi-structured interviews, deduction, inductive reasoning, the recording of anecdotes and analogies. The presentations drew on financial investment theory, the industrial experience of workshop participants, discussions with students developing tools, modern economic theories and speculation on the effects of company capabilities.
Resumo:
Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper first presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Second, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Third, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation.
Resumo:
An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language. © 2012 IEEE.
Resumo:
For many applications, it is necessary to produce speech transcriptions in a causal fashion. To produce high quality transcripts, speaker adaptation is often used. This requires online speaker clustering and incremental adaptation techniques to be developed. This paper presents an integrated approach to online speaker clustering and adaptation which allows efficient clustering of speakers using the same accumulated statistics that are normally used for adaptation. Using a consistent criterion for both clustering and adaptation should yield gains for both stages. The proposed approach is evaluated on a meetings transcription task using audio from multiple distant microphones. Consistent gains over standard clustering and adaptation were obtained. Copyright © 2011 ISCA.
Resumo:
Increasing demand for energy and continuing increase in environmental as well as financial cost of use of fossil fuels drive the need for utilization of fuels from sustainable sources for power generation. Development of fuel-flexible combustion systems is vital in enabling the use of sustainable fuels. It is also important that these sustainable combustion systems meet the strict governmental emission legislations. Biogas is considered as one of the viable sustainable fuels that can be used to power modern gas turbines: However, the change in chemical, thermal and transport properties as well as change in Wobbe index due to the variation of the fuel constituents can have a significant effect on the performance of the combustor. It is known that the fuel properties have strong influence on the dynamic flame response; however there is a lack of detailed information regarding the effect of fuel compositions on the sensitivity of the flames subjected to flow perturbations. In this study, we describe an experimental effort investigating the response of premixed biogas-air turbulent flames with varying proportions of CH4 and CO2 to velocity perturbations. The flame was stabilized using a centrally placed conical bluff body. Acoustic perturbations were imposed to the flow using loud speakers. The flame dynamics and the local heat release rate of these acoustically excited biogas flames were studied using simultaneous measurements of OH and H2CO planar laser induced fluorescence. OH* chemiluminescence along with acoustic pressure measurements were also recorded to estimate the total flame heat release modulation and the velocity fluctuations. The measurements were carried out by keeping the theoretical laminar flame speed constant while varying the bulk velocity and the fuel composition. The results indicate that the flame sensitivity to perturbations increased with increased dilution of CH4 by CO2 at low amplitude forcing, while at high amplitude forcing conditions the magnitude of the flame response was independent of dilution.