289 resultados para word recognition

em Queensland University of Technology - ePrints Archive


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In recent times, the improved levels of accuracy obtained by Automatic Speech Recognition (ASR) technology has made it viable for use in a number of commercial products. Unfortunately, these types of applications are limited to only a few of the world’s languages, primarily because ASR development is reliant on the availability of large amounts of language specific resources. This motivates the need for techniques which reduce this language-specific, resource dependency. Ideally, these approaches should generalise across languages, thereby providing scope for rapid creation of ASR capabilities for resource poor languages. Cross Lingual ASR emerges as a means for addressing this need. Underpinning this approach is the observation that sound production is largely influenced by the physiological construction of the vocal tract, and accordingly, is human, and not language specific. As a result, a common inventory of sounds exists across languages; a property which is exploitable, as sounds from a resource poor, target language can be recognised using models trained on resource rich, source languages. One of the initial impediments to the commercial uptake of ASR technology was its fragility in more challenging environments, such as conversational telephone speech. Subsequent improvements in these environments has gained consumer confidence. Pragmatically, if cross lingual techniques are to considered a viable alternative when resources are limited, they need to perform under the same types of conditions. Accordingly, this thesis evaluates cross lingual techniques using two speech environments; clean read speech and conversational telephone speech. Languages used in evaluations are German, Mandarin, Japanese and Spanish. Results highlight that previously proposed approaches provide respectable results for simpler environments such as read speech, but degrade significantly when in the more taxing conversational environment. Two separate approaches for addressing this degradation are proposed. The first is based on deriving better target language lexical representation, in terms of the source language model set. The second, and ultimately more successful approach, focuses on improving the classification accuracy of context-dependent (CD) models, by catering for the adverse influence of languages specific phonotactic properties. Whilst the primary research goal in this thesis is directed towards improving cross lingual techniques, the catalyst for investigating its use was based on expressed interest from several organisations for an Indonesian ASR capability. In Indonesia alone, there are over 200 million speakers of some Malay variant, provides further impetus and commercial justification for speech related research on this language. Unfortunately, at the beginning of the candidature, limited research had been conducted on the Indonesian language in the field of speech science, and virtually no resources existed. This thesis details the investigative and development work dedicated towards obtaining an ASR system with a 10000 word recognition vocabulary for the Indonesian language.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Voice recognition is one of the key enablers to reduce driver distraction as in-vehicle systems become more and more complex. With the integration of voice recognition in vehicles, safety and usability are improved as the driver’s eyes and hands are not required to operate system controls. Whilst speaker independent voice recognition is well developed, performance in high noise environments (e.g. vehicles) is still limited. La Trobe University and Queensland University of Technology have developed a low-cost hardware-based speech enhancement system for automotive environments based on spectral subtraction and delay–sum beamforming techniques. The enhancement algorithms have been optimised using authentic Australian English collected under typical driving conditions. Performance tests conducted using speech data collected under variety of vehicle noise conditions demonstrate a word recognition rate improvement in the order of 10% or more under the noisiest conditions. Currently developed to a proof of concept stage there is potential for even greater performance improvement.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Audio-visualspeechrecognition, or the combination of visual lip-reading with traditional acoustic speechrecognition, has been previously shown to provide a considerable improvement over acoustic-only approaches in noisy environments, such as that present in an automotive cabin. The research presented in this paper will extend upon the established audio-visualspeechrecognition literature to show that further improvements in speechrecognition accuracy can be obtained when multiple frontal or near-frontal views of a speaker's face are available. A series of visualspeechrecognition experiments using a four-stream visual synchronous hidden Markov model (SHMM) are conducted on the four-camera AVICAR automotiveaudio-visualspeech database. We study the relative contribution between the side and central orientated cameras in improving visualspeechrecognition accuracy. Finally combination of the four visual streams with a single audio stream in a five-stream SHMM demonstrates a relative improvement of over 56% in word recognition accuracy when compared to the acoustic-only approach in the noisiest conditions of the AVICAR database.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Free association norms indicate that words are organized into semantic/associative neighborhoods within a larger network of words and links that bind the net together. We present evidence indicating that memory for a recent word event can depend on implicitly and simultaneously activating related words in its neighborhood. Processing a word during encoding primes its network representation as a function of the density of the links in its neighborhood. Such priming increases recall and recognition and can have long lasting effects when the word is processed in working memory. Evidence for this phenomenon is reviewed in extralist cuing, primed free association, intralist cuing, and single-item recognition tasks. The findings also show that when a related word is presented to cue the recall of a studied word, the cue activates it in an array of related words that distract and reduce the probability of its selection. The activation of the semantic network produces priming benefits during encoding and search costs during retrieval. In extralist cuing recall is a negative function of cue-to-distracter strength and a positive function of neighborhood density, cue-to-target strength, and target-to cue strength. We show how four measures derived from the network can be combined and used to predict memory performance. These measures play different roles in different tasks indicating that the contribution of the semantic network varies with the context provided by the task. We evaluate spreading activation and quantum-like entanglement explanations for the priming effect produced by neighborhood density.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Digital devices like smart phones and tablet computers are becoming commonplace in young children’s lives for play, entertainment, learning and communication. Recently, there has been a great deal of focus on the educational potential of devices like iPads in both formal and informal educational settings. There is now an abundance of educational ‘apps’ available to children, parents, and kindergarten and pre-school teachers that claim to enhance children’s early literacy and numeracy development and creativity. To date, though, there has been very little formal investigation of the educational potential of these devices. This book discusses the impact on children’s learning when iPads were introduced in three very different kindergartens in Brisbane, Australia. Chapters outline how researchers worked with pre-school teachers and parents to explore how iPads can assist with letter and word recognition, the development of oral literacy and talk around play. The book also considers the possibilities for using iPads for creativity and arts education through photography, storytelling, drawing, music creation and audio recording.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We implemented an n-gram mutual information (NGMI) based segmentation algorithm with the mixed-up features from unsupervised, supervised and dictionarybased segmentation methods. This algorithm is also combined with a simple strategy for out-of-vocabulary (OOV) word recognition. The evaluation for both open and closed training shows encouraging results of our system. The results for OOV word recognition in closed training evaluation were however found unsatisfactory.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

What helps us determine whether a word is a noun or a verb, without conscious awareness? We report on cues in the way individual English words are spelled, and, for the first time, identify their neural correlates via functional magnetic resonance imaging (fMRI). We used a lexical decision task with trisyllabic nouns and verbs containing orthographic cues that are either consistent or inconsistent with the spelling patterns of words from that grammatical category. Significant linear increases in response times and error rates were observed as orthography became less consistent, paralleled by significant linear decreases in blood oxygen level dependent (BOLD) signal in the left supramarginal gyrus of the left inferior parietal lobule, a brain region implicated in visual word recognition. A similar pattern was observed in the left superior parietal lobule. These findings align with an emergentist view of grammatical category processing which results from sensitivity to multiple probabilistic cues.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We investigated the neural correlates of semantic priming by using event-related fMRI to record blood oxygen level dependent (BOLD) responses while participants performed speeded lexical decisions (word/nonword) on visually presented related versus unrelated prime-target pairs. A long stimulus onset asynchrony of 1000 ms was employed, which allowed for increased controlled processing and selective frequency-based ambiguity priming. Conditions included an ambiguous word prime (e.g. bank) and a target related to its dominant (e.g. money) or subordinate meaning (e.g. river). Compared to an unrelated condition, primed dominant targets were associated with increased activity in the LIFG, the right anterior cingulate and superior temporal gyrus, suggesting postlexical semantic integrative mechanisms, while increased right supramarginal activity for the unrelated condition was consistent with expectancy based priming. Subordinate targets were not primed and were associated with reduced activity primarily in occipitotemporal regions associated with word recognition, which may be consistent with frequency-based meaning suppression. These findings provide new insights into the neural substrates of semantic priming and the functional-anatomic correlates of lexical ambiguity suppression mechanisms.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We used event-related fMRI to investigate the neural correlates of encoding strength and word frequency effects in recognition memory. At test, participants made Old/New decisions to intermixed low (LF) and high frequency (HF) words that had been presented once or twice at study and to new, unstudied words. The Old/New effect for all hits vs. correctly rejected unstudied words was associated with differential activity in multiple cortical regions, including the anterior medial temporal lobe (MTL), hippocampus, left lateral parietal cortex and anterior left inferior prefrontal cortex (LIPC). Items repeated at study had superior hit rates (HR) compared to items presented once and were associated with reduced activity in the right anterior MTL. By contrast, other regions that had shown conventional Old/New effects did not demonstrate modulation according to memory strength. A mirror effect for word frequency was demonstrated, with the LF word HR advantage associated with increased activity in the left lateral temporal cortex. However, none of the regions that had demonstrated Old/New item retrieval effects showed modulation according to word frequency. These findings are interpreted as supporting single-process memory models proposing a unitary strength-like memory signal and models attributing the LF word HR advantage to the greater lexico-semantic context-noise associated with HF words due to their being experienced in many pre-experimental contexts.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we propose a new method for utilising phase information by complementing it with traditional magnitude-only spectral subtraction speech enhancement through Complex Spectrum Subtraction (CSS). The proposed approach has the following advantages over traditional magnitude-only spectral subtraction: (a) it introduces complementary information to the enhancement algorithm; (b) it reduces the total number of algorithmic parameters, and; (c) is designed for improving clean speech magnitude spectra and is therefore suitable for both automatic speech recognition (ASR) and speech perception applications. Oracle-based ASR experiments verify this approach, showing an average of 20% relative word accuracy improvements when accurate estimates of the phase spectrum are available. Based on sinusoidal analysis and assuming stationarity between observations (which is shown to be better approximated as the frame rate is increased), this paper also proposes a novel method for acquiring the phase information called Phase Estimation via Delay Projection (PEDEP). Further oracle ASR experiments validate the potential for the proposed PEDEP technique in ideal conditions. Realistic implementation of CSS with PEDEP shows performance comparable to state of the art spectral subtraction techniques in a range of 15-20 dB signal-to-noise ratio environments. These results clearly demonstrate the potential for using phase spectra in spectral subtractive enhancement applications, and at the same time highlight the need for deriving more accurate phase estimates in a wider range of noise conditions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Visual noise insensitivity is important to audio visual speech recognition (AVSR). Visual noise can take on a number of forms such as varying frame rate, occlusion, lighting or speaker variabilities. The use of a high dimensional secondary classifier on the word likelihood scores from both the audio and video modalities is investigated for the purposes of adaptive fusion. Preliminary results are presented demonstrating performance above the catastrophic fusion boundary for our confidence measure irrespective of the type of visual noise presented to it. Our experiments were restricted to small vocabulary applications.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Word frequency (WF) and strength effects are two important phenomena associated with episodic memory. The former refers to the superior hit-rate (HR) for low (LF) compared to high frequency (HF) words in recognition memory, while the latter describes the incremental effect(s) upon HRs associated with repeating an item at study. Using the "subsequent memory" method with event-related fMRI, we tested the attention-at-encoding (AE) [M. Glanzer, J.K. Adams, The mirror effect in recognition memory: data and theory, J. Exp. Psychol.: Learn Mem. Cogn. 16 (1990) 5-16] explanation of the WF effect. In addition to investigating encoding strength, we addressed if study involves accessing prior representations of repeated items via the same mechanism as that at test [J.L. McClelland, M. Chappell, Familiarity breeds differentiation: a subjective-likelihood approach to the effects of experience in recognition memory, Psychol. Rev. 105 (1998) 724-760], entailing recollection [K.J. Malmberg, J.E. Holden, R.M. Shiffrin, Modeling the effects of repetitions, similarity, and normative word frequency on judgments of frequency and recognition memory, J. Exp. Psychol.: Learn Mem. Cogn. 30 (2004) 319-331] and whether less processing effort is entailed for encoding each repetition [M. Cary, L.M. Reder, A dual-process account of the list-length and strength-based mirror effects in recognition, J. Mem. Lang. 49 (2003) 231-248]. The increased BOLD responses observed in the left inferior prefrontal cortex (LIPC) for the WF effect provide support for an AE account. Less effort does appear to be required for encoding each repetition of an item, as reduced BOLD responses were observed in the LIPC and left lateral temporal cortex; both regions demonstrated increased responses in the conventional subsequent memory analysis. At test, a left lateral parietal BOLD response was observed for studied versus unstudied items, while only medial parietal activity was observed for repeated items at study, indicating that accessing prior representations at encoding does not necessarily occur via the same mechanism as that at test, and is unlikely to involve a conscious recall-like process such as recollection. This information may prove useful for constraining cognitive theories of episodic memory.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Automatic speech recognition from multiple distant micro- phones poses significant challenges because of noise and reverberations. The quality of speech acquisition may vary between microphones because of movements of speakers and channel distortions. This paper proposes a channel selection approach for selecting reliable channels based on selection criterion operating in the short-term modulation spectrum domain. The proposed approach quantifies the relative strength of speech from each microphone and speech obtained from beamforming modulations. The new technique is compared experimentally in the real reverb conditions in terms of perceptual evaluation of speech quality (PESQ) measures and word error rate (WER). Overall improvement in recognition rate is observed using delay-sum and superdirective beamformers compared to the case when the channel is selected randomly using circular microphone arrays.