242 resultados para automatic speech recognition


Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper investigates the effects of limited speech data in the context of speaker verification using a probabilistic linear discriminant analysis (PLDA) approach. Being able to reduce the length of required speech data is important to the development of automatic speaker verification system in real world applications. When sufficient speech is available, previous research has shown that heavy-tailed PLDA (HTPLDA) modeling of speakers in the i-vector space provides state-of-the-art performance, however, the robustness of HTPLDA to the limited speech resources in development, enrolment and verification is an important issue that has not yet been investigated. In this paper, we analyze the speaker verification performance with regards to the duration of utterances used for both speaker evaluation (enrolment and verification) and score normalization and PLDA modeling during development. Two different approaches to total-variability representation are analyzed within the PLDA approach to show improved performance in short-utterance mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. The results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset suggest that the HTPLDA system can continue to achieve better performance than Gaussian PLDA (GPLDA) as evaluation utterance lengths are decreased. We also highlight the importance of matching durations for score normalization and PLDA modeling to the expected evaluation conditions. Finally, we found that a pooled total-variability approach to PLDA modeling can achieve better performance than the traditional concatenated total-variability approach for short utterances in mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This work aims at developing a planetary rover capable of acting as an assistant astrobiologist: making a preliminary analysis of the collected visual images that will help to make better use of the scientists time by pointing out the most interesting pieces of data. This paper focuses on the problem of detecting and recognising particular types of stromatolites. Inspired by the processes actual astrobiologists go through in the field when identifying stromatolites, the processes we investigate focus on recognising characteristics associated with biogenicity. The extraction of these characteristics is based on the analysis of geometrical structure enhanced by passing the images of stromatolites into an edge-detection filter and its Fourier Transform, revealing typical spatial frequency patterns. The proposed analysis is performed on both simulated images of stromatolite structures and images of real stromatolites taken in the field by astrobiologists.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

To identify and categorize complex stimuli such as familiar objects or speech, the human brain integrates information that is abstracted at multiple levels from its sensory inputs. Using cross-modal priming for spoken words and sounds, this functional magnetic resonance imaging study identified 3 distinct classes of visuoauditory incongruency effects: visuoauditory incongruency effects were selective for 1) spoken words in the left superior temporal sulcus (STS), 2) environmental sounds in the left angular gyrus (AG), and 3) both words and sounds in the lateral and medial prefrontal cortices (IFS/mPFC). From a cognitive perspective, these incongruency effects suggest that prior visual information influences the neural processes underlying speech and sound recognition at multiple levels, with the STS being involved in phonological, AG in semantic, and mPFC/IFS in higher conceptual processing. In terms of neural mechanisms, effective connectivity analyses (dynamic causal modeling) suggest that these incongruency effects may emerge via greater bottom-up effects from early auditory regions to intermediate multisensory integration areas (i.e., STS and AG). This is consistent with a predictive coding perspective on hierarchical Bayesian inference in the cortex where the domain of the prediction error (phonological vs. semantic) determines its regional expression (middle temporal gyrus/STS vs. AG/intraparietal sulcus).

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Robustness to variations in environmental conditions and camera viewpoint is essential for long-term place recognition, navigation and SLAM. Existing systems typically solve either of these problems, but invariance to both remains a challenge. This paper presents a training-free approach to lateral viewpoint- and condition-invariant, vision-based place recognition. Our successive frame patch-tracking technique infers average scene depth along traverses and automatically rescales views of the same place at different depths to increase their similarity. We combine our system with the condition-invariant SMART algorithm and demonstrate place recognition between day and night, across entire 4-lane-plus-median-strip roads, where current algorithms fail.

Relevância:

30.00% 30.00%

Publicador:

Relevância:

30.00% 30.00%

Publicador:

Resumo:

For several reasons, the Fourier phase domain is less favored than the magnitude domain in signal processing and modeling of speech. To correctly analyze the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term delta-phase spectrum is used to distinguish this from instantaneous derivatives. Experiments show that mel-frequency cepstral coefficients features derived from the delta-phase spectrum (termed Mel-Frequency delta-phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Inspection of solder joints has been a critical process in the electronic manufacturing industry to reduce manufacturing cost, improve yield, and ensure product quality and reliability. This paper proposes two inspection modules for an automatic solder joint classification system. The “front-end” inspection system includes illumination normalisation, localisation and segmentation. The “back-end” inspection involves the classification of solder joints using the Log Gabor filter and classifier fusion. Five different levels of solder quality with respect to the amount of solder paste have been defined. The Log Gabor filter has been demonstrated to achieve high recognition rates and is resistant to misalignment. This proposed system does not need any special illumination system, and the images are acquired by an ordinary digital camera. This system could contribute to the development of automated non-contact, non-destructive and low cost solder joint quality inspection systems.