907 resultados para Visual Speaker Recognition, Visual Speech Recognition, Cascading Appearance-Based Features


Relevância:

50.00% 50.00%

Publicador:

Resumo:

How do visual form and motion processes cooperate to compute object motion when each process separately is insufficient? A 3D FORMOTION model specifies how 3D boundary representations, which separate figures from backgrounds within cortical area V2, capture motion signals at the appropriate depths in MT; how motion signals in MT disambiguate boundaries in V2 via MT-to-Vl-to-V2 feedback; how sparse feature tracking signals are amplified; and how a spatially anisotropic motion grouping process propagates across perceptual space via MT-MST feedback to integrate feature-tracking and ambiguous motion signals to determine a global object motion percept. Simulated data include: the degree of motion coherence of rotating shapes observed through apertures, the coherent vs. element motion percepts separated in depth during the chopsticks illusion, and the rigid vs. non-rigid appearance of rotating ellipses.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

How do visual form and motion processes cooperate to compute object motion when each process separately is insufficient? Consider, for example, a deer moving behind a bush. Here the partially occluded fragments of motion signals available to an observer must be coherently grouped into the motion of a single object. A 3D FORMOTION model comprises five important functional interactions involving the brain’s form and motion systems that address such situations. Because the model’s stages are analogous to areas of the primate visual system, we refer to the stages by corresponding anatomical names. In one of these functional interactions, 3D boundary representations, in which figures are separated from their backgrounds, are formed in cortical area V2. These depth-selective V2 boundaries select motion signals at the appropriate depths in MT via V2-to-MT signals. In another, motion signals in MT disambiguate locally incomplete or ambiguous boundary signals in V2 via MT-to-V1-to-V2 feedback. The third functional property concerns resolution of the aperture problem along straight moving contours by propagating the influence of unambiguous motion signals generated at contour terminators or corners. Here, sparse “feature tracking signals” from, e.g., line ends, are amplified to overwhelm numerically superior ambiguous motion signals along line segment interiors. In the fourth, a spatially anisotropic motion grouping process takes place across perceptual space via MT-MST feedback to integrate veridical feature-tracking and ambiguous motion signals to determine a global object motion percept. The fifth property uses the MT-MST feedback loop to convey an attentional priming signal from higher brain areas back to V1 and V2. The model's use of mechanisms such as divisive normalization, endstopping, cross-orientation inhibition, and longrange cooperation is described. Simulated data include: the degree of motion coherence of rotating shapes observed through apertures, the coherent vs. element motion percepts separated in depth during the chopsticks illusion, and the rigid vs. non-rigid appearance of rotating ellipses.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Existing work in Computer Science and Electronic Engineering demonstrates that Digital Signal Processing techniques can effectively identify the presence of stress in the speech signal. These techniques use datasets containing real or actual stress samples i.e. real-life stress such as 911 calls and so on. Studies that use simulated or laboratory-induced stress have been less successful and inconsistent. Pervasive, ubiquitous computing is increasingly moving towards voice-activated and voice-controlled systems and devices. Speech recognition and speaker identification algorithms will have to improve and take emotional speech into account. Modelling the influence of stress on speech and voice is of interest to researchers from many different disciplines including security, telecommunications, psychology, speech science, forensics and Human Computer Interaction (HCI). The aim of this work is to assess the impact of moderate stress on the speech signal. In order to do this, a dataset of laboratory-induced stress is required. While attempting to build this dataset it became apparent that reliably inducing measurable stress in a controlled environment, when speech is a requirement, is a challenging task. This work focuses on the use of a variety of stressors to elicit a stress response during tasks that involve speech content. Biosignal analysis (commercial Brain Computer Interfaces, eye tracking and skin resistance) is used to verify and quantify the stress response, if any. This thesis explains the basis of the author’s hypotheses on the elicitation of affectively-toned speech and presents the results of several studies carried out throughout the PhD research period. These results show that the elicitation of stress, particularly the induction of affectively-toned speech, is not a simple matter and that many modulating factors influence the stress response process. A model is proposed to reflect the author’s hypothesis on the emotional response pathways relating to the elicitation of stress with a required speech content. Finally the author provides guidelines and recommendations for future research on speech under stress. Further research paths are identified and a roadmap for future research in this area is defined.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The authors are concerned with the development of computer systems that are capable of using information from faces and voices to recognise people's emotions in real-life situations. The paper addresses the nature of the challenges that lie ahead, and provides an assessment of the progress that has been made in the areas of signal processing and analysis techniques (with regard to speech and face), and the psychological and linguistic analyses of emotion. Ongoing developmental work by the authors in each of these areas is described.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Handling appearance variations is a very challenging problem for visual tracking. Existing methods usually solve this problem by relying on an effective appearance model with two features: (1) being capable of discriminating the tracked target from its background, (2) being robust to the target's appearance variations during tracking. Instead of integrating the two requirements into the appearance model, in this paper, we propose a tracking method that deals with these problems separately based on sparse representation in a particle filter framework. Each target candidate defined by a particle is linearly represented by the target and background templates with an additive representation error. Discriminating the target from its background is achieved by activating the target templates or the background templates in the linear system in a competitive manner. The target's appearance variations are directly modeled as the representation error. An online algorithm is used to learn the basis functions that sparsely span the representation error. The linear system is solved via ℓ1 minimization. The candidate with the smallest reconstruction error using the target templates is selected as the tracking result. We test the proposed approach using four sequences with heavy occlusions, large pose variations, drastic illumination changes and low foreground-background contrast. The proposed approach shows excellent performance in comparison with two latest state-of-the-art trackers.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

In this paper we demonstrate a simple and novel illumination model that can be used for illumination invariant facial recognition. This model requires no prior knowledge of the illumination conditions and can be used when there is only a single training image per-person. The proposed illumination model separates the effects of illumination over a small area of the face into two components; an additive component modelling the mean illumination and a multiplicative component, modelling the variance within the facial area. Illumination invariant facial recognition is performed in a piecewise manner, by splitting the face image into blocks, then normalizing the illumination within each block based on the new lighting model. The assumptions underlying this novel lighting model have been verified on the YaleB face database. We show that magnitude 2D Fourier features can be used as robust facial descriptors within the new lighting model. Using only a single training image per-person, our new method achieves high (in most cases 100%) identification accuracy on the YaleB, extended YaleB and CMU-PIE face databases.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

For the first time in this paper we present results showing the effect of speaker head pose angle on automatic lip-reading performance over a wide range of closely spaced angles. We analyse the effect head pose has upon the features themselves and show that by selecting coefficients with minimum variance w.r.t. pose angle, recognition performance can be improved when train-test pose angles differ. Experiments are conducted using the initial phase of a unique multi view Audio-Visual database designed specifically for research and development of pose-invariant lip-reading systems. We firstly show that it is the higher order horizontal spatial frequency components that become most detrimental as the pose deviates. Secondly we assess the performance of different feature selection masks across a range of pose angles including a new mask based on Minimum Cross-Pose Variance coefficients. We report a relative improvement of 50% in Word Error Rate when using our selection mask over a common energy based selection during profile view lip-reading.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Sparse representation based visual tracking approaches have attracted increasing interests in the community in recent years. The main idea is to linearly represent each target candidate using a set of target and trivial templates while imposing a sparsity constraint onto the representation coefficients. After we obtain the coefficients using L1-norm minimization methods, the candidate with the lowest error, when it is reconstructed using only the target templates and the associated coefficients, is considered as the tracking result. In spite of promising system performance widely reported, it is unclear if the performance of these trackers can be maximised. In addition, computational complexity caused by the dimensionality of the feature space limits these algorithms in real-time applications. In this paper, we propose a real-time visual tracking method based on structurally random projection and weighted least squares techniques. In particular, to enhance the discriminative capability of the tracker, we introduce background templates to the linear representation framework. To handle appearance variations over time, we relax the sparsity constraint using a weighed least squares (WLS) method to obtain the representation coefficients. To further reduce the computational complexity, structurally random projection is used to reduce the dimensionality of the feature space while preserving the pairwise distances between the data points in the feature space. Experimental results show that the proposed approach outperforms several state-of-the-art tracking methods.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Adults' expert face recognition is limited to the kinds of faces they encounter on a daily basis (typically upright human faces of the same race). Adults process own-race faces holistically (Le., as a gestalt) and are exquisitely sensitive to small differences among faces in the spacing of features, the shape of individual features and the outline or contour of the face (Maurer, Le Grand, & Mondloch, 2002), however this expertise does not seem to extend to faces from other races. The goal of the current study was to investigate the extent to which the mechanisms that underlie expert face processing of own-race faces extend to other-race faces. Participants from rural Pennsylvania that had minimal exposure to other-race faces were tested on a battery of tasks. They were tested on a memory task, two measures of holistic processing (the composite task and the part/whole task), two measures of spatial and featural processing (the JanelLing task and the scrambledlblurred faces task) and a test of contour processing (JanelLing task) for both own-and other-race faces. No study to date has tested the same participants on all of these tasks. Participants had minimal experience with other-race faces; they had no Chinese family members, friends or had ever traveled to an Asian country. Results from the memory task did not reveal an other-race effect. In the present study, participants also demonstrated holistic processing of both own- and other-race faces on both the composite task and the part/whole task. These findings contradict previous findings that Caucasian adults process own-race faces more holistically than other-race faces. However participants did demonstrate an own-race advantage for processing the spacing among features, consistent with two recent studies that used different manipulations of spacing cues (Hayward et al. 2007; Rhodes et al. 2006). They also demonstrated an other-race effect for the processing of individual features for the Jane/Ling task (a direct measure of featural processing) consistent with previous findings (Rhodes, Hayward, & Winkler, 2006), but not for the scrambled faces task (an indirect measure offeatural processing). There was no own-race advantage for contour processing. Thus, these results lead to the conclusion that individuals may show less sensitivity to the appearance of individual features and the spacing among them in other-race faces, despite processing other-race faces holistically.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A primary medium for the human beings to communicate through language is Speech. Automatic Speech Recognition is wide spread today. Recognizing single digits is vital to a number of applications such as voice dialling of telephone numbers, automatic data entry, credit card entry, PIN (personal identification number) entry, entry of access codes for transactions, etc. In this paper we present a comparative study of SVM (Support Vector Machine) and HMM (Hidden Markov Model) to recognize and identify the digits used in Malayalam speech.