944 resultados para Visual Speech Recognition, Multiple Views, Frontal View, Profile View


Relevância:

50.00% 50.00%

Publicador:

Resumo:

While researchers in computer vision and pattern recognition have worked on automatic techniques for recognizing faces for the last 20 years, most systems specialize on frontal views of the face. We present a face recognizer that works under varying pose, the difficult part of which is to handle face rotations in depth. Building on successful template-based systems, our basic approach is to represent faces with templates from multiple model views that cover different poses from the viewing sphere. Our system has achieved a recognition rate of 98% on a data base of 62 people containing 10 testing and 15 modelling views per person.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We consider the problem of detecting a large number of different classes of objects in cluttered scenes. Traditional approaches require applying a battery of different classifiers to the image, at multiple locations and scales. This can be slow and can require a lot of training data, since each classifier requires the computation of many different image features. In particular, for independently trained detectors, the (run-time) computational complexity, and the (training-time) sample complexity, scales linearly with the number of classes to be detected. It seems unlikely that such an approach will scale up to allow recognition of hundreds or thousands of objects. We present a multi-class boosting procedure (joint boosting) that reduces the computational and sample complexity, by finding common features that can be shared across the classes (and/or views). The detectors for each class are trained jointly, rather than independently. For a given performance level, the total number of features required, and therefore the computational cost, is observed to scale approximately logarithmically with the number of classes. The features selected jointly are closer to edges and generic features typical of many natural structures instead of finding specific object parts. Those generic features generalize better and reduce considerably the computational cost of an algorithm for multi-class object detection.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A mechanism is proposed that integrates low-level (image processing), mid-level (recursive 3D trajectory estimation), and high-level (action recognition) processes. It is assumed that the system observes multiple moving objects via a single, uncalibrated video camera. A novel extended Kalman filter formulation is used in estimating the relative 3D motion trajectories up to a scale factor. The recursive estimation process provides a prediction and error measure that is exploited in higher-level stages of action recognition. Conversely, higher-level mechanisms provide feedback that allows the system to reliably segment and maintain the tracking of moving objects before, during, and after occlusion. The 3D trajectory, occlusion, and segmentation information are utilized in extracting stabilized views of the moving object. Trajectory-guided recognition (TGR) is proposed as a new and efficient method for adaptive classification of action. The TGR approach is demonstrated using "motion history images" that are then recognized via a mixture of Gaussian classifier. The system was tested in recognizing various dynamic human outdoor activities; e.g., running, walking, roller blading, and cycling. Experiments with synthetic data sets are used to evaluate stability of the trajectory estimator with respect to noise.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Neuronal receptive fields (RFs) provide the foundation for understanding systems-level sensory processing. In early visual areas, investigators have mapped RFs in detail using stochastic stimuli and sophisticated analytical approaches. Much less is known about RFs in prefrontal cortex. Visual stimuli used for mapping RFs in prefrontal cortex tend to cover a small range of spatial and temporal parameters, making it difficult to understand their role in visual processing. To address these shortcomings, we implemented a generalized linear model to measure the RFs of neurons in the macaque frontal eye field (FEF) in response to sparse, full-field stimuli. Our high-resolution, probabilistic approach tracked the evolution of RFs during passive fixation, and we validated our results against conventional measures. We found that FEF neurons exhibited a surprising level of sensitivity to stimuli presented as briefly as 10 ms or to multiple dots presented simultaneously, suggesting that FEF visual responses are more precise than previously appreciated. FEF RF spatial structures were largely maintained over time and between stimulus conditions. Our results demonstrate that the application of probabilistic RF mapping to FEF and similar association areas is an important tool for clarifying the neuronal mechanisms of cognition.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

For the first time in this paper we present results showing the effect of speaker head pose angle on automatic lip-reading performance over a wide range of closely spaced angles. We analyse the effect head pose has upon the features themselves and show that by selecting coefficients with minimum variance w.r.t. pose angle, recognition performance can be improved when train-test pose angles differ. Experiments are conducted using the initial phase of a unique multi view Audio-Visual database designed specifically for research and development of pose-invariant lip-reading systems. We firstly show that it is the higher order horizontal spatial frequency components that become most detrimental as the pose deviates. Secondly we assess the performance of different feature selection masks across a range of pose angles including a new mask based on Minimum Cross-Pose Variance coefficients. We report a relative improvement of 50% in Word Error Rate when using our selection mask over a common energy based selection during profile view lip-reading.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Empirical studies concerning face recognition suggest that faces may be stored in memory by a few canonical representations. Models of visual perception are based on image representations in cortical area V1 and beyond, which contain many cell layers for feature extraction. Simple, complex and end-stopped cells provide input for line, edge and keypoint detection. Detected events provide a rich, multi-scale object representation, and this representation can be stored in memory in order to identify objects. In this paper, the above context is applied to face recognition. The multi-scale line/edge representation is explored in conjunction with keypoint-based saliency maps for Focus-of-Attention. Recognition rates of up to 96% were achieved by combining frontal and 3/4 views, and recognition was quite robust against partial occlusions.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The report addresses the problem of visual recognition under two sources of variability: geometric and photometric. The geometric deals with the relation between 3D objects and their views under orthographic and perspective projection. The photometric deals with the relation between 3D matte objects and their images under changing illumination conditions. Taken together, an alignment-based method is presented for recognizing objects viewed from arbitrary viewing positions and illuminated by arbitrary settings of light sources.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We present a statistical image-based shape + structure model for Bayesian visual hull reconstruction and 3D structure inference. The 3D shape of a class of objects is represented by sets of contours from silhouette views simultaneously observed from multiple calibrated cameras. Bayesian reconstructions of new shapes are then estimated using a prior density constructed with a mixture model and probabilistic principal components analysis. We show how the use of a class-specific prior in a visual hull reconstruction can reduce the effect of segmentation errors from the silhouette extraction process. The proposed method is applied to a data set of pedestrian images, and improvements in the approximate 3D models under various noise conditions are shown. We further augment the shape model to incorporate structural features of interest; unknown structural parameters for a novel set of contours are then inferred via the Bayesian reconstruction process. Model matching and parameter inference are done entirely in the image domain and require no explicit 3D construction. Our shape model enables accurate estimation of structure despite segmentation errors or missing views in the input silhouettes, and works even with only a single input view. Using a data set of thousands of pedestrian images generated from a synthetic model, we can accurately infer the 3D locations of 19 joints on the body based on observed silhouette contours from real images.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We present an example-based learning approach for locating vertical frontal views of human faces in complex scenes. The technique models the distribution of human face patterns by means of a few view-based "face'' and "non-face'' prototype clusters. At each image location, the local pattern is matched against the distribution-based model, and a trained classifier determines, based on the local difference measurements, whether or not a human face exists at the current image location. We provide an analysis that helps identify the critical components of our system.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We investigate the differences --- conceptually and algorithmically --- between affine and projective frameworks for the tasks of visual recognition and reconstruction from perspective views. It is shown that an affine invariant exists between any view and a fixed view chosen as a reference view. This implies that for tasks for which a reference view can be chosen, such as in alignment schemes for visual recognition, projective invariants are not really necessary. We then use the affine invariant to derive new algebraic connections between perspective views. It is shown that three perspective views of an object are connected by certain algebraic functions of image coordinates alone (no structure or camera geometry needs to be involved).

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Human object recognition is generally considered to tolerate changes of the stimulus position in the visual field. A number of recent studies, however, have cast doubt on the completeness of translation invariance. In a new series of experiments we tried to investigate whether positional specificity of short-term memory is a general property of visual perception. We tested same/different discrimination of computer graphics models that were displayed at the same or at different locations of the visual field, and found complete translation invariance, regardless of the similarity of the animals and irrespective of direction and size of the displacement (Exp. 1 and 2). Decisions were strongly biased towards same decisions if stimuli appeared at a constant location, while after translation subjects displayed a tendency towards different decisions. Even if the spatial order of animal limbs was randomized ("scrambled animals"), no deteriorating effect of shifts in the field of view could be detected (Exp. 3). However, if the influence of single features was reduced (Exp. 4 and 5) small but significant effects of translation could be obtained. Under conditions that do not reveal an influence of translation, rotation in depth strongly interferes with recognition (Exp. 6). Changes of stimulus size did not reduce performance (Exp. 7). Tolerance to these object transformations seems to rely on different brain mechanisms, with translation and scale invariance being achieved in principle, while rotation invariance is not.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Motivation: Intrinsic protein disorder is functionally implicated in numerous biological roles and is, therefore, ubiquitous in proteins from all three kingdoms of life. Determining the disordered regions in proteins presents a challenge for experimental methods and so recently there has been much focus on the development of improved predictive methods. In this article, a novel technique for disorder prediction, called DISOclust, is described, which is based on the analysis of multiple protein fold recognition models. The DISOclust method is rigorously benchmarked against the top.ve methods from the CASP7 experiment. In addition, the optimal consensus of the tested methods is determined and the added value from each method is quantified. Results: The DISOclust method is shown to add the most value to a simple consensus of methods, even in the absence of target sequence homology to known structures. A simple consensus of methods that includes DISOclust can significantly outperform all of the previous individual methods tested.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Perceptual effects of room reverberation on a "sir" or "stir" test-word can be observed when the level of reverberation in the word is increased, while the reverberation in a surrounding 'context I utterance remains at a minimal level. The result is that listeners make more "sit" identifications. When the context's reverberation is also increased, to approach the level in the test word, extrinsic perceptual compensation is observed, so that the number of listeners' "sir" identifications reduces to a value similar to that found with minimal reverberation. Thus far, compensation effects have only been observed with speech or speech-like contexts in which the short-term spectrum changes as the speaker's articulators move. The results reported here show that some noise contexts with static short-term spectra can also give rise to compensation. From these experiments it would appear that compensation requires a context with a temporal envelope that fluctuates to some extent, so that parts of it resemble offsets. These findings are consistent with a rather general kind of perceptual compensation mechanism; one that is informed by the 'tails' that reverberation adds at offsets. Other results reported here show that narrow-band contexts do not bring about compensation, even when their temporal-envelopes are the same as those of the more effective wideband contexts. These results suggest that compensation is confined to the frequency range occupied by the context, and that in a wideband sound it might operate in a 'band by band' manner.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This paper describes a real-time multi-camera surveillance system that can be applied to a range of application domains. This integrated system is designed to observe crowded scenes and has mechanisms to improve tracking of objects that are in close proximity. The four component modules described in this paper are (i) motion detection using a layered background model, (ii) object tracking based on local appearance, (iii) hierarchical object recognition, and (iv) fused multisensor object tracking using multiple features and geometric constraints. This integrated approach to complex scene tracking is validated against a number of representative real-world scenarios to show that robust, real-time analysis can be performed. Copyright (C) 2007 Hindawi Publishing Corporation. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

When speech is in competition with interfering sources in rooms, monaural indicators of intelligibility fail to take account of the listener’s abilities to separate target speech from interfering sounds using the binaural system. In order to incorporate these segregation abilities and their susceptibility to reverberation, Lavandier and Culling [J. Acoust. Soc. Am. 127, 387–399 (2010)] proposed a model which combines effects of better-ear listening and binaural unmasking. A computationally efficient version of this model is evaluated here under more realistic conditions that include head shadow, multiple stationary noise sources, and real-room acoustics. Three experiments are presented in which speech reception thresholds were measured in the presence of one to three interferers using real-room listening over headphones, simulated by convolving anechoic stimuli with binaural room impulse-responses measured with dummy-head transducers in five rooms. Without fitting any parameter of the model, there was close correspondence between measured and predicted differences in threshold across all tested conditions. The model’s components of better-ear listening and binaural unmasking were validated both in isolation and in combination. The computational efficiency of this prediction method allows the generation of complex “intelligibility maps” from room designs. © 2012 Acoustical Society of America