112 resultados para Speech journalistic unified
Resumo:
Keyword Spotting is the task of detecting keywords of interest within continu- ous speech. The applications of this technology range from call centre dialogue systems to covert speech surveillance devices. Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unre- stricted vocabulary audio document indexing. However, to date, many keyword spotting approaches have su®ered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability. This work investigates the application of keyword spotting to data mining tasks. The thesis makes a number of major contributions to the ¯eld of keyword spotting. The ¯rst major contribution is the development of a novel keyword veri¯cation method named Cohort Word Veri¯cation. This method combines high level lin- guistic information with cohort-based veri¯cation techniques to obtain dramatic improvements in veri¯cation performance, in particular for the problematic short duration target word class. The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting. This technique aug- ments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations. The resulting algorithm obtains signi¯cant improvement in detection rate over lattice-based audio document indexing while still maintaining extremely fast search speeds. The third major contribution is the study of multiple veri¯er fusion for the task of keyword veri¯cation. The reported experiments demonstrate that substantial improvements in veri¯cation performance can be obtained through the fusion of multiple keyword veri¯ers. The research focuses on combinations of speech background model based veri¯ers and cohort word veri¯ers. The ¯nal major contribution is a comprehensive study of the e®ects of limited training data for keyword spotting. This study is performed with consideration as to how these e®ects impact the immediate development and deployment of speech technologies for non-English languages.
Resumo:
Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.
Resumo:
In this paper we propose a new method for utilising phase information by complementing it with traditional magnitude-only spectral subtraction speech enhancement through Complex Spectrum Subtraction (CSS). The proposed approach has the following advantages over traditional magnitude-only spectral subtraction: (a) it introduces complementary information to the enhancement algorithm; (b) it reduces the total number of algorithmic parameters, and; (c) is designed for improving clean speech magnitude spectra and is therefore suitable for both automatic speech recognition (ASR) and speech perception applications. Oracle-based ASR experiments verify this approach, showing an average of 20% relative word accuracy improvements when accurate estimates of the phase spectrum are available. Based on sinusoidal analysis and assuming stationarity between observations (which is shown to be better approximated as the frame rate is increased), this paper also proposes a novel method for acquiring the phase information called Phase Estimation via Delay Projection (PEDEP). Further oracle ASR experiments validate the potential for the proposed PEDEP technique in ideal conditions. Realistic implementation of CSS with PEDEP shows performance comparable to state of the art spectral subtraction techniques in a range of 15-20 dB signal-to-noise ratio environments. These results clearly demonstrate the potential for using phase spectra in spectral subtractive enhancement applications, and at the same time highlight the need for deriving more accurate phase estimates in a wider range of noise conditions.
Resumo:
In this paper, we present a microphone array beamforming approach to blind speech separation. Unlike previous beamforming approaches, our system does not require a-priori knowledge of the microphone placement and speaker location, making the system directly comparable other blind source separation methods which require no prior knowledge of recording conditions. Microphone location is automatically estimated using an assumed noise field model, and speaker locations are estimated using cross correlation based methods. The system is evaluated on the data provided for the PASCAL Speech Separation Challenge 2 (SSC2), achieving a word error rate of 58% on the evaluation set.
Resumo:
Voice recognition is one of the key enablers to reduce driver distraction as in-vehicle systems become more and more complex. With the integration of voice recognition in vehicles, safety and usability are improved as the driver’s eyes and hands are not required to operate system controls. Whilst speaker independent voice recognition is well developed, performance in high noise environments (e.g. vehicles) is still limited. La Trobe University and Queensland University of Technology have developed a low-cost hardware-based speech enhancement system for automotive environments based on spectral subtraction and delay–sum beamforming techniques. The enhancement algorithms have been optimised using authentic Australian English collected under typical driving conditions. Performance tests conducted using speech data collected under variety of vehicle noise conditions demonstrate a word recognition rate improvement in the order of 10% or more under the noisiest conditions. Currently developed to a proof of concept stage there is potential for even greater performance improvement.
Resumo:
This paper proposes a generic decoupled imagebased control scheme for cameras obeying the unified projection model. The scheme is based on the spherical projection model. Invariants to rotational motion are computed from this projection and used to control the translational degrees of freedom. Importantly we form invariants which decrease the sensitivity of the interaction matrix to object depth variation. Finally, the proposed results are validated with experiments using a classical perspective camera as well as a fisheye camera mounted on a 6-DOF robotic platform.
Resumo:
Purpose–The growing debate in the literature indicates that the initiative to implement Knowledge Based Urban Development (KBUD) approaches in urban development process is neither simple nor quick. Many research efforts has therefore, been put forward to the development of appropriate KBUD framework and KBUD practical approaches. But this has lead to a fragmented and incoherent methodological approach. This paper outlines and compares a few most popular KBUD frameworks selected from the literature. It aims to identify some key and common features in the effort to achieve a unified method of KBUD framework. Design/methodology/approach–This paper reviews, examines and identifies various popular KBUD frameworks discussed in the literature from urban planners’ viewpoint. It employs a content analysis technique i.e. a research tool used to determine the presence of certain words or concepts within texts or sets of texts. Originality/value–The paper reports on the key and common features of a few selected most popular KBUD frameworks. The synthesis of the results is based from a perspective of urban planners. The findings which encompass a new KBUD framework incorporating the key and common features will be valuable in setting a platform to achieve a unified method of KBUD. Practical implications –The discussion and results presented in this paper should be significant to researchers and practitioners and to any cities and countries that are aiming for KBUD. Keywords – Knowledge based urban development, Knowledge based urban development framework, Urban development and knowledge economy
Resumo:
Interacting with technology within a vehicle environment using a voice interface can greatly reduce the effects of driver distraction. Most current approaches to this problem only utilise the audio signal, making them susceptible to acoustic noise. An obvious approach to circumvent this is to use the visual modality in addition. However, capturing, storing and distributing audio-visual data in a vehicle environment is very costly and difficult. One current dataset available for such research is the AVICAR [1] database. Unfortunately this database is largely unusable due to timing mismatch between the two streams and in addition, no protocol is available. We have overcome this problem by re-synchronising the streams on the phone-number portion of the dataset and established a protocol for further research. This paper presents the first audio-visual results on this dataset for speaker-independent speech recognition. We hope this will serve as a catalyst for future research in this area.
Resumo:
A number of recent books on ethics (Hirst and Patching 2005, Tanner et al, 2005, Ward, 2006)have indicated that traditional understandings of journalism "objectivity" are in need of renovation if they are to sustain the claim as a guide to ethical action. Ward argues for the recasting of the notions of traditional objectivity to offer a "pragmatic objectivity" as an alternative and plausible underpinning to ethical journalism practice. He argues that a recast or "pragmatic objectivity" should respond to the changing rhetorical relationship between journalists and their audiences; and, in so doing, should take inspiration from attempts to be objective in other domains---professions such as law and public relations in seeking models. This paper seeks to take a step in that direction through illustrating how journalism interviews do "objectivity" through an adaptation of the principles of the "Fourth Estate" to political interviews. It turns such analysis to the ends of establishing the particular "pragmatic ethic" underpinning such practices and how journalism interviewing techniques has allowed for proactive journalists to strike a workable balance between pursuing the public interest and observing the restraining protocols of modern journalistic practice.
Resumo:
In a recent journal article, Luke Jaaniste and I identified an emergent model of exegesis. From a content analysis of submitted exegeses within a local archive, we identified an approach that is quite different from the traditional thesis, but is also distinct from previously identified forms of exegesis, which Milech and Schilo have described as a ‘context model’ (which assumes the voice of academic objectivity and provides an historical or theoretical context for the creative practice) and a ‘commentary’ model’ (which takes the form of a first person reflection on the challenges, insights and achievements of the practice). The model we identified combines these dichotomous forms and assumes a dual orientation–looking outwards to the established field of research, exemplars and theories, and inwards to the methodologies, processes and outcomes of the practice. We went on to argue that this ‘connective’ exegesis offers clear benefits to the researcher in connecting the practice to an established field while allowing the researcher to demonstrate how the methods have led to outcomes that advance the field in some way. And, while it helps the candidate to articulate objective claims for research innovation, it enables them to retain a voiced, personal relationship with their practice. However, it also poses considerable complexities and challenges in the writing. It requires a reconciliation of multi-perspectival subject positions: the disinterested perspective and academic objectivity of an observer/ethnographer/analyst/theorist at times and the invested perspective of the practitioner/ producer at others. The author must also contend with a range of writing styles, speech genres and voices: from the formal, polemical voice of the theorist to the personal, questioning and sometimes emotive voice of reflexivity. Moreover, the connective exegesis requires the researcher to synthesize various perspectives, subject positions, writing styles, and voices into a unified and coherent text. In this paper I consider strategies for writing a hybrid, connective exegesis. I first ground the discussion on polyvocality and alternate textual structures through reference to recent discussions in philosophy and critical theory, and point to examples of emergent approaches to texts and practices in related fields. I then return to the collection of archived exegeses to investigate the strategies that postgraduate candidates have adopted to resolve the problems that arise from a polyvocal, connective exegesis.
Resumo:
Visual noise insensitivity is important to audio visual speech recognition (AVSR). Visual noise can take on a number of forms such as varying frame rate, occlusion, lighting or speaker variabilities. The use of a high dimensional secondary classifier on the word likelihood scores from both the audio and video modalities is investigated for the purposes of adaptive fusion. Preliminary results are presented demonstrating performance above the catastrophic fusion boundary for our confidence measure irrespective of the type of visual noise presented to it. Our experiments were restricted to small vocabulary applications.