926 resultados para Signal Processing, EMD, Thresholding, Acceleration, Displacement, Structural Identification
Resumo:
Visual noise insensitivity is important to audio visual speech recognition (AVSR). Visual noise can take on a number of forms such as varying frame rate, occlusion, lighting or speaker variabilities. The use of a high dimensional secondary classifier on the word likelihood scores from both the audio and video modalities is investigated for the purposes of adaptive fusion. Preliminary results are presented demonstrating performance above the catastrophic fusion boundary for our confidence measure irrespective of the type of visual noise presented to it. Our experiments were restricted to small vocabulary applications.
Resumo:
The performance of automatic speech recognition systems deteriorates in the presence of noise. One known solution is to incorporate video information with an existing acoustic speech recognition system. We investigate the performance of the individual acoustic and visual sub-systems and then examine different ways in which the integration of the two systems may be performed. The system is to be implemented in real time on a Texas Instruments' TMS320C80 DSP.
Resumo:
This paper investigates the use of lip information, in conjunction with speech information, for robust speaker verification in the presence of background noise. It has been previously shown in our own work, and in the work of others, that features extracted from a speaker's moving lips hold speaker dependencies which are complementary with speech features. We demonstrate that the fusion of lip and speech information allows for a highly robust speaker verification system which outperforms the performance of either sub-system. We present a new technique for determining the weighting to be applied to each modality so as to optimize the performance of the fused system. Given a correct weighting, lip information is shown to be highly effective for reducing the false acceptance and false rejection error rates in the presence of background noise
Resumo:
The use of visual features in the form of lip movements to improve the performance of acoustic speech recognition has been shown to work well, particularly in noisy acoustic conditions. However, whether this technique can outperform speech recognition incorporating well-known acoustic enhancement techniques, such as spectral subtraction, or multi-channel beamforming is not known. This is an important question to be answered especially in an automotive environment, for the design of an efficient human-vehicle computer interface. We perform a variety of speech recognition experiments on a challenging automotive speech dataset and results show that synchronous HMM-based audio-visual fusion can outperform traditional single as well as multi-channel acoustic speech enhancement techniques. We also show that further improvement in recognition performance can be obtained by fusing speech-enhanced audio with the visual modality, demonstrating the complementary nature of the two robust speech recognition approaches.
Resumo:
CCTV and surveillance networks are increasingly being used for operational as well as security tasks. One emerging area of technology that lends itself to operational analytics is soft biometrics. Soft biometrics can be used to describe a person and detect them throughout a sparse multi-camera network. This enables them to be used to perform tasks such as determining the time taken to get from point to point, and the paths taken through an environment by detecting and matching people across disjoint views. However, in a busy environment where there are 100's if not 1000's of people such as an airport, attempting to monitor everyone is highly unrealistic. In this paper we propose an average soft biometric, that can be used to identity people who look distinct, and are thus suitable for monitoring through a large, sparse camera network. We demonstrate how an average soft biometric can be used to identify unique people to calculate operational measures such as the time taken to travel from point to point.
Resumo:
Unusual event detection in crowded scenes remains challenging because of the diversity of events and noise. In this paper, we present a novel approach for unusual event detection via sparse reconstruction of dynamic textures over an overcomplete basis set, with the dynamic texture described by local binary patterns from three orthogonal planes (LBPTOP). The overcomplete basis set is learnt from the training data where only the normal items observed. In the detection process, given a new observation, we compute the sparse coefficients using the Dantzig Selector algorithm which was proposed in the literature of compressed sensing. Then the reconstruction errors are computed, based on which we detect the abnormal items. Our application can be used to detect both local and global abnormal events. We evaluate our algorithm on UCSD Abnormality Datasets for local anomaly detection, which is shown to outperform current state-of-the-art approaches, and we also get promising results for rapid escape detection using the PETS2009 dataset.
Resumo:
This paper describes a scene invariant crowd counting algorithm that uses local features to monitor crowd size. Unlike previous algorithms that require each camera to be trained separately, the proposed method uses camera calibration to scale between viewpoints, allowing a system to be trained and tested on different scenes. A pre-trained system could therefore be used as a turn-key solution for crowd counting across a wide range of environments. The use of local features allows the proposed algorithm to calculate local occupancy statistics, and Gaussian process regression is used to scale to conditions which are unseen in the training data, also providing confidence intervals for the crowd size estimate. A new crowd counting database is introduced to the computer vision community to enable a wider evaluation over multiple scenes, and the proposed algorithm is tested on seven datasets to demonstrate scene invariance and high accuracy. To the authors' knowledge this is the first system of its kind due to its ability to scale between different scenes and viewpoints.
Resumo:
This paper proposes the use of eigenvoice modeling techniques with the Cross Likelihood Ratio (CLR) as a criterion for speaker clustering within a speaker diarization system. The CLR has previously been shown to be a robust decision criterion for speaker clustering using Gaussian Mixture Models. Recently, eigenvoice modeling techniques have become increasingly popular, due to its ability to adequately represent a speaker based on sparse training data, as well as an improved capture of differences in speaker characteristics. This paper hence proposes that it would be beneficial to capitalize on the advantages of eigenvoice modeling in a CLR framework. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 35.1% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system.
Resumo:
Compressive Sensing (CS) is a popular signal processing technique, that can exactly reconstruct a signal given a small number of random projections of the original signal, provided that the signal is sufficiently sparse. We demonstrate the applicability of CS in the field of gait recognition as a very effective dimensionality reduction technique, using the gait energy image (GEI) as the feature extraction process. We compare the CS based approach to the principal component analysis (PCA) and show that the proposed method outperforms this baseline, particularly under situations where there are appearance changes in the subject. Applying CS to the gait features also avoids the need to train the models, by using a generalised random projection.
Resumo:
A new approach to pattern recognition using invariant parameters based on higher order spectra is presented. In particular, invariant parameters derived from the bispectrum are used to classify one-dimensional shapes. The bispectrum, which is translation invariant, is integrated along straight lines passing through the origin in bifrequency space. The phase of the integrated bispectrum is shown to be scale and amplification invariant, as well. A minimal set of these invariants is selected as the feature vector for pattern classification, and a minimum distance classifier using a statistical distance measure is used to classify test patterns. The classification technique is shown to distinguish two similar, but different bolts given their one-dimensional profiles. Pattern recognition using higher order spectral invariants is fast, suited for parallel implementation, and has high immunity to additive Gaussian noise. Simulation results show very high classification accuracy, even for low signal-to-noise ratios.
Resumo:
Statistics of the estimates of tricoherence are obtained analytically for nonlinear harmonic random processes with known true tricoherence. Expressions are presented for the bias, variance, and probability distributions of estimates of tricoherence as functions of the true tricoherence and the number of realizations averaged in the estimates. The expressions are applicable to arbitrary higher order coherence and arbitrary degree of interaction between modes. Theoretical results are compared with those obtained from numerical simulations of nonlinear harmonic random processes. Estimation of true values of tricoherence given observed values is also discussed
Resumo:
This paper presents results on the robustness of higher-order spectral features to Gaussian, Rayleigh, and uniform distributed noise. Based on cluster plots and accuracy results for various signal to noise conditions, the higher-order spectral features are shown to be better than moment invariant features.
Resumo:
We consider a robust filtering problem for uncertain discrete-time, homogeneous, first-order, finite-state hidden Markov models (HMMs). The class of uncertain HMMs considered is described by a conditional relative entropy constraint on measures perturbed from a nominal regular conditional probability distribution given the previous posterior state distribution and the latest measurement. Under this class of perturbations, a robust infinite horizon filtering problem is first formulated as a constrained optimization problem before being transformed via variational results into an unconstrained optimization problem; the latter can be elegantly solved using a risk-sensitive information-state based filtering.
Resumo:
In this paper, we seek to expand the use of direct methods in real-time applications by proposing a vision-based strategy for pose estimation of aerial vehicles. The vast majority of approaches make use of features to estimate motion. Conversely, the strategy we propose is based on a MR (Multi- Resolution) implementation of an image registration technique (Inverse Compositional Image Alignment ICIA) using direct methods. An on-board camera in a downwards-looking configuration, and the assumption of planar scenes, are the bases of the algorithm. The motion between frames (rotation and translation) is recovered by decomposing the frame-to-frame homography obtained by the ICIA algorithm applied to a patch that covers around the 80% of the image. When the visual estimation is required (e.g. GPS drop-out), this motion is integrated with the previous known estimation of the vehicles’ state, obtained from the on-board sensors (GPS/IMU), and the subsequent estimations are based only on the vision-based motion estimations. The proposed strategy is tested with real flight data in representative stages of a flight: cruise, landing, and take-off, being two of those stages considered critical: take-off and landing. The performance of the pose estimation strategy is analyzed by comparing it with the GPS/IMU estimations. Results show correlation between the visual estimation obtained with the MR-ICIA and the GPS/IMU data, that demonstrate that the visual estimation can be used to provide a good approximation of the vehicle’s state when it is required (e.g. GPS drop-outs). In terms of performance, the proposed strategy is able to maintain an estimation of the vehicle’s state for more than one minute, at real-time frame rates based, only on visual information.