19 resultados para Visual Speech Recognition, Multiple Views, Frontal View, Profile View


100.00% 100.00%



Over the course of the last decade, infrared (IR) and particularly thermal IR imaging based face recognition has emerged as a promising complement to conventional, visible spectrum based approaches which continue to struggle when applied in practice. While inherently insensitive to visible spectrum illumination changes, IR data introduces specific challenges of its own, most notably sensitivity to factors which affect facial heat emission patterns, e.g. emotional state, ambient temperature, and alcohol intake. In addition, facial expression and pose changes are more difficult to correct in IR images because they are less rich in high frequency detail which is an important cue for fitting any deformable model. In this paper we describe a novel method which addresses these major challenges. Specifically, when comparing two thermal IR images of faces, we mutually normalize their poses and facial expressions by using an active appearance model (AAM) to generate synthetic images of the two faces with a neutral facial expression and in the same view (the average of the two input views). This is achieved by piecewise affine warping which follows AAM fitting. A major contribution of our work is the use of an AAM ensemble in which each AAM is specialized to a particular range of poses and a particular region of the thermal IR face space. Combined with the contributions from our previous work which addressed the problem of reliable AAM fitting in the thermal IR spectrum, and the development of a person-specific representation robust to transient changes in the pattern of facial temperature emissions, the proposed ensemble framework accurately matches faces across the full range of yaw from frontal to profile, even in the presence of scale variation (e.g. due to the varying distance of a subject from the camera). The effectiveness of the proposed approach is demonstrated on the largest public database of thermal IR images of faces and a newly acquired data set of thermal IR motion videos. Our approach achieved perfect recognition performance on both data sets, significantly outperforming the current state of the art methods even when they are trained with multiple images spanning a range of head views.


100.00% 100.00%



Face recognition with multiple views is a challenging research problem. Most of the existing works have focused on extracting shared information among multiple views to improve recognition. However, when the pose variation is too large or missing, 'shared information' may not be properly extracted, leading to poor recognition results. In this paper, we propose a novel method for face recognition with multiple view images to overcome the large pose variation and missing pose issue. By introducing a novel mixed norm, the proposed method automatically selects candidates from the gallery to best represent a group of highly correlated face images in a query set to improve classification accuracy. This mixed norm combines the advantages of both sparse representation based classification (SRC) and joint sparse representation based classification (JSRC). A trade off between the ℓ1-norm from SRC and ℓ2,1-norm from JSRC is introduced to achieve this goal. Due to this property, the proposed method decreases the influence when a face image is unseen and has large pose variation in the recognition process. And when some face images with a certain degree of unseen pose variation appear, this mixed norm will find an optimal representation for these query images based on the shared information induced from multiple views. Moreover, we also address an open problem in robust sparse representation and classification which is using ℓ1-norm on the loss function to achieve a robust solution. To solve this formulation, we derive a simple, yet provably convergent algorithm based on the powerful alternative directions method of multipliers (ADMM) framework. We provide extensive comparisons which demonstrate that our method outperforms other state-of-the-arts algorithms on CMU-PIE, Yale B and Multi-PIE databases for multi-view face recognition.


100.00% 100.00%



This paper outlines a methodology to generate a distinctive object representation offline, using short-baseline stereo fundamentals to triangulate highly descriptive object features in multiple pairs of stereo images. A group of sparse 2.5D perspective views are built and the multiple views are then fused into a single sparse 3D model using a common 3D shape registration technique. Having prior knowledge, such as the proposed sparse feature model, is useful when detecting an object and estimating its pose for real-time systems like augmented reality.


100.00% 100.00%



In this paper, we present a method for recognising an agent's behaviour in dynamic, noisy, uncertain domains, and across multiple levels of abstraction. We term this problem on-line plan recognition under uncertainty and view it generally as probabilistic inference on the stochastic process representing the execution of the agent's plan. Our contributions in this paper are twofold. In terms of probabilistic inference, we introduce the Abstract Hidden Markov Model (AHMM), a novel type of stochastic processes, provide its dynamic Bayesian network (DBN) structure and analyse the properties of this network. We then describe an application of the Rao-Blackwellised Particle Filter to the AHMM which allows us to construct an efficient, hybrid inference method for this model. In terms of plan recognition, we propose a novel plan recognition framework based on the AHMM as the plan execution model. The Rao-Blackwellised hybrid inference for AHMM can take advantage of the independence properties inherent in a model of plan execution, leading to an algorithm for online probabilistic plan recognition that scales well with the number of levels in the plan hierarchy. This illustrates that while stochastic models for plan execution can be complex, they exhibit special structures which, if exploited, can lead to efficient plan recognition algorithms. We demonstrate the usefulness of the AHMM framework via a behaviour recognition system in a complex spatial environment using distributed video surveillance data.


100.00% 100.00%



In this paper, we present our system for online context recognition of multimodal sequences acquired from multiple sensors. The system uses Dynamic Time Warping (DTW) to recognize multimodal sequences of different lengths, embedded in continuous data streams. We evaluate the performance of our system on two real world datasets: 1) accelerometer data acquired from performing two hand gestures and 2) NOKIA's benchmark dataset for context recognition. The results from both datasets demonstrate that the system can perform online context recognition efficiently and achieve high recognition accuracy.


100.00% 100.00%



In many real-world computer vision applications, such as multi-camera surveillance, the objects of interest are captured by visual sensors concurrently, resulting in multi-view data. These views usually provide complementary information to each other. One recent and powerful computer vision method for clustering is sparse subspace clustering (SSC); however, it was not designed for multi-view data, which break down its linear separability assumption. To integrate complementary information between views, multi-view clustering algorithms are required to improve the clustering performance. In this paper, we propose a novel multi-view subspace clustering by searching for an unified latent structure as a global affinity matrix in subspace clustering. Due to the integration of affinity matrices for each view, this global affinity matrix can best represent the relationship between clusters. This could help us achieve better performance on face clustering. We derive a provably convergent algorithm based on the alternating direction method of multipliers (ADMM) framework, which is computationally efficient, to solve the formulation. We demonstrate that this formulation outperforms other alternatives based on state-of-The-Arts on challenging multi-view face datasets.


100.00% 100.00%



This paper maps the current debates surrounding school-based and university-based teacher education models, and presents a ‘multiple-space’ model of teacher education that both explores and values the many ‘forgotten’ spaces that teachers work in. It draws from a variety of research studies, including my own doctoral work, to argue for a new approach to teacher education programs. I suggest that in order for teacher education to move beyond separatist, binary models, we need to adopt a ‘multiple-space’ view of learning to be a teacher that embraces the notion that teachers do not learn about theory in a university space, nor do they simply work in a classroom space.


100.00% 100.00%



Speaker recognition is the process of automatically recognizing the speaker by analyzing individual information contained in the speech waves. In this paper, we discuss the development of an intelligent system for text-dependent speaker recognition. The system comprises two main modules, a wavelet-based signal-processing module for feature extraction of speech waves, and an artificial-neural-network-based classifier module to identify and categorize the speakers. Wavelet is used in de-noising and in compressing the speech signals. The wavelet family that we used is the Daubechies Wavelets. After extracting the necessary features from the speech waves, the features were then fed to a neural-network-based classifier to identify the speakers. We have implemented the Fuzzy ARTMAP (FAM) network in the classifier module to categorize the de-noised and compressed signals. The proposed intelligent learning system has been applied to a case study of text-dependent speaker recognition problem.


100.00% 100.00%



Super-resolution is a method of post-processing image enhancement that increases the spatial resolution of video or images. Existing super-resolution techniques apply only to images captured of a planar scene. This paper aims to extend super-resolution concepts from the 2D domain to the 3D domain, drawing on ideas from both superresolution and multi-view geometry, two fields of research that until now have predominantly been studied in isolation. 2D super-resolution methods are not without their complexities and limitations. However, once multiple views of a scene are considered within a super-resolution framework, a new range of issues arise that must also be resolved. For example, when input images of a scene with variation in depth are considered, it is no longer clear how and where the images should be registered. This paper describes the use of sparse 3D reconstruction in order to ‘register’ the input images, which are then transferred to a novel image plane and combined to increase the perceived detail in the scene. Experimental results using real images captured from generally positioned input cameras are presented.


100.00% 100.00%



This paper reports a single case of ipsilesional left neglect dyslexia and interprets it according to the three-level model of visual word recognition proposed by Caramazza and Hillis (1990). The three levels reflect a progression from the physical stimulus to an abstract representation of a word. RR was not impaired at the first, retinocentric, level, which represents the individual features of letters within a word according to the location of the word in the visual field: She made the same number of errors to words presented in her left visual field as in her right visual field. A deficit at this level should also mean the patient neglects all stimuli. This did not occur with RR: She did not neglect when naming the items in rows of objects and rows of geometric symbols. In addition, although she displayed significant neglect dyslexia when making visual matching judgements on pairs of words and nonwords, she did not do so to pairs of nonsense letter shapes, shapes which display the same level of visual complexity as letters in words. RR was not impaired at the third, graphemic, level, which represents the ordinal positions of letters within a word: She continued to neglect the leftmost (spatial) letter of words presented in mirror-reversed orientation and she did not neglect in oral spelling. By elimination, these results suggest RR's deficit affects a spatial reference frame where the representational space is bounded by the stimulus: A stimulus-centred level of representation. We define five characteristics of a stimulus-centred deficit, as manifest in RR. First, it is not the case that neglect dyslexia occurs because the remaining letters in a string attract or capture attention away from the leftmost letter(s). Second, the deficit is continuous across the letter string. Third, perceptually significant features, such as spaces, define potential words. Fourth, the whole, rather than part, of a letter is neglected. Fifth, category information is preserved. It is concluded that the Caramazza-Hillis model accounts well for RR's data, although we conclude that neglect dyslexia can be present when a more general visuospatial neglect is absent.


100.00% 100.00%



Multimedia information is now routinely available in the forms of text, pictures, animation and sound. Although text objects are relatively easy to deal with (in terms of information search and retrieval), other information bearing objects (such as sound, images, animation) are more difficult to index. Our research is aimed at developing better ways of representing multimedia objects by using a conceptual representation based on Schank's conceptual dependencies. Moreover, the representation allows for users' individual interpretations to be embedded in the system. This will alleviate the problems associated with traditional semantic networks by allowing for coexistence of multiple views of the same information. The viability of the approach is tested, and the preliminary results reported.


100.00% 100.00%



We develop an algorithm for the detection and classification of affective sound events underscored by specific patterns of sound energy dynamics. We relate the portrayal of these events to proposed high level affect or emotional coloring of the events. In this paper, four possible characteristic sound energy events are identified that convey well established meanings through their dynamics to portray and deliver certain affect, sentiment related to the horror film genre. Our algorithm is developed with the ultimate aim of automatically structuring sections of films that contain distinct shades of emotion related to horror themes for nonlinear media access and navigation. An average of 82% of the energy events, obtained from the analysis of the audio tracks of sections of four sample films corresponded correctly to the proposed affect. While the discrimination between certain sound energy event types was low, the algorithm correctly detected 71% of the occurrences of the sound energy events within audio tracks of the films analyzed, and thus forms a useful basis for determining affective scenes characteristic of horror in movies.


100.00% 100.00%



Multimedia information is now routinely available in the forms of text, pictures, animation and sound. Although text objects are relatively easy to deal with (in terms of information search and retrieval), other information bearing objects (such as sound, images, animation) are more difficult to index. Our research is aimed at developing better ways of representing multimedia objects by using a conceptual representation based on Schank's conceptual dependencies. Moreover, the representation allows for users' individual interpretations to be embedded in the system. This will alleviate the problems associated with traditional semantic networks by allowing for coexistence of multiple views of the same information. The viability of the approach is tested, and the preliminary results reported.