66 resultados para Visual Speaker Recognition, Visual Speech Recognition, Cascading Appearance-Based Features


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, a novel video-based multimodal biometric verification scheme using the subspace-based low-level feature fusion of face and speech is developed for specific speaker recognition for perceptual human--computer interaction (HCI). In the proposed scheme, human face is tracked and face pose is estimated to weight the detected facelike regions in successive frames, where ill-posed faces and false-positive detections are assigned with lower credit to enhance the accuracy. In the audio modality, mel-frequency cepstral coefficients are extracted for voice-based biometric verification. In the fusion step, features from both modalities are projected into nonlinear Laplacian Eigenmap subspace for multimodal speaker recognition and combined at low level. The proposed approach is tested on the video database of ten human subjects, and the results show that the proposed scheme can attain better accuracy in comparison with the conventional multimodal fusion using latent semantic analysis as well as the single-modality verifications. The experiment on MATLAB shows the potential of the proposed scheme to attain the real-time performance for perceptual HCI applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a novel method of audio-visual feature-level fusion for person identification where both the speech and facial modalities may be corrupted, and there is a lack of prior knowledge about the corruption. Furthermore, we assume there are limited amount of training data for each modality (e.g., a short training speech segment and a single training facial image for each person). A new multimodal feature representation and a modified cosine similarity are introduced to combine and compare bimodal features with limited training data, as well as vastly differing data rates and feature sizes. Optimal feature selection and multicondition training are used to reduce the mismatch between training and testing, thereby making the system robust to unknown bimodal corruption. Experiments have been carried out on a bimodal dataset created from the SPIDRE speaker recognition database and AR face recognition database with variable noise corruption of speech and occlusion in the face images. The system's speaker identification performance on the SPIDRE database, and facial identification performance on the AR database, is comparable with the literature. Combining both modalities using the new method of multimodal fusion leads to significantly improved accuracy over the unimodal systems, even when both modalities have been corrupted. The new method also shows improved identification accuracy compared with the bimodal systems based on multicondition model training or missing-feature decoding alone.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Temporal dynamics and speaker characteristics are two important features of speech that distinguish speech from noise. In this paper, we propose a method to maximally extract these two features of speech for speech enhancement. We demonstrate that this can reduce the requirement for prior information about the noise, which can be difficult to estimate for fast-varying noise. Given noisy speech, the new approach estimates clean speech by recognizing long segments of the clean speech as whole units. In the recognition, clean speech sentences, taken from a speech corpus, are used as examples. Matching segments are identified between the noisy sentence and the corpus sentences. The estimate is formed by using the longest matching segments found in the corpus sentences. Longer speech segments as whole units contain more distinct dynamics and richer speaker characteristics, and can be identified more accurately from noise than shorter speech segments. Therefore, estimation based on the longest recognized segments increases the noise immunity and hence the estimation accuracy. The new approach consists of a statistical model to represent up to sentence-long temporal dynamics in the corpus speech, and an algorithm to identify the longest matching segments between the noisy sentence and the corpus sentences. The algorithm is made more robust to noise uncertainty by introducing missing-feature based noise compensation into the corpus sentences. Experiments have been conducted on the TIMIT database for speech enhancement from various types of nonstationary noise including song, music, and crosstalk speech. The new approach has shown improved performance over conventional enhancement algorithms in both objective and subjective evaluations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Automatic gender classification has many security and commercial applications. Various modalities have been investigated for gender classification with face-based classification being the most popular. In some real-world scenarios the face may be partially occluded. In these circumstances a classification based on individual parts of the face known as local features must be adopted. We investigate gender classification using lip movements. We show for the first time that important gender specific information can be obtained from the way in which a person moves their lips during speech. Furthermore our study indicates that the lip dynamics during speech provide greater gender discriminative information than simply lip appearance. We also show that the lip dynamics and appearance contain complementary gender information such that a model which captures both traits gives the highest overall classification result. We use Discrete Cosine Transform based features and Gaussian Mixture Modelling to model lip appearance and dynamics and employ the XM2VTS database for our experiments. Our experiments show that a model which captures lip dynamics along with appearance can improve gender classification rates by between 16-21% compared to models of only lip appearance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a novel skeleton-based approach to gait recognition using our Skeleton Variance Image. The core of our approach consists of employing the screened Poisson equation to construct a family of smooth distance functions associated with a given shape. The screened Poisson distance function approximation nicely absorbs and is relatively stable to shape boundary perturbations which allows us to define a rough shape skeleton. We demonstrate how our Skeleton Variance Image is a powerful gait cycle descriptor leading to a significant improvement over the existing state of the art gait recognition rate.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Purpose. To evaluate the intrafamilial phenotypic variation in Stargardt macular dystrophy-Fundus flavimaculatus (SMD-FFM). Methods. Thirty-one siblings from 15 families with SMD-FFM were examined. Age of onset, visual acuity, and clinical features on fundus examination and fundus autofluorescence images, including presence or absence of central and peripheral atrophy and distribution of flecks, were recorded. In addition, electrophysiological studies were undertaken. Results. Large differences between siblings in age of onset (median, 12 years; range, 5-23 years) were observed in six of the 15 families studied, whereas in 9 families differences in age of onset between siblings were small (median, 1 year; range, 0-3 years). Visual acuity varied two or more lines among siblings in nine families. In 10 families (67%) siblings were found to have different clinical appearance on fundus examination and fundus autofluorescence images, whereas in 5 families (33%), affected siblings had similar clinical features. Electrodiagnostic tests were performed on affected members of 12 families and disclosed similar qualitative findings among siblings. In nine families there was loss of central function only; in two, global loss of cone function; and in one, global loss of cone and rod function. Conclusions. In this series, although differences in age of onset, visual acuity, and fundus appearance were observed between siblings, electrophysiological studies demonstrated intrafamilial homogeneity in retinal function. The findings are difficult to reconcile with expression studies showing ABCR transcripts in rod photoreceptors but not in cones.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

PURPOSE:

To investigate the heritability of intraocular pressure (IOP) and cup-to-disc ratio (CDR) in an older well-defined population.

DESIGN:

Family-based cohort study.

PARTICIPANTS:

Through the population-based Salisbury Eye Evaluation study, we recruited 726 siblings (mean age, 74.7 years) in 284 sibships.

METHODS:

Intraocular pressure and CDR were measured bilaterally for all participants. The presence or absence of glaucoma was determined by a glaucoma specialist for all probands on the basis of visual field, optic nerve appearance, and history. The heritability of IOP was calculated as twice the residual between-sibling correlation of IOP using linear regression and generalized estimating equations after adjusting for age, gender, mean arterial pressure, race, self-reported diabetes status, and history of systemic steroid use. The heritability of CDR was calculated using the same model and adjustments as above, while also adjusting for IOP.

MAIN OUTCOME MEASURES:

Heritability and determinants of IOP and CDR, and impact of siblings' glaucoma status on IOP and CDR.

RESULTS:

We estimated the heritability to be 0.29 (95% confidence interval [CI], 0.12-0.46) for IOP and 0.56 (95% CI, 0.35-0.76) for CDR in this population. Mean IOP in siblings of glaucomatous probands was statistically significantly higher than in siblings of normal probands (mean difference, 1.02 mmHg; P = 0.017). The mean CDR in siblings of glaucomatous probands was 0.07 (or 19%) larger than in siblings of glaucoma suspect referrals (P = 0.045) and siblings of normal probands (P = 0.004).

CONCLUSIONS:

In this elderly population, we found CDR to be highly heritable and IOP to be moderately heritable. On average, siblings of glaucoma patients had higher IOPs and larger CDRs than siblings of nonglaucomatous probands.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The grading of crushed aggregate is carried out usually by sieving. We describe a new image-based approach to the automatic grading of such materials. The operational problem addressed is where the camera is located directly over a conveyor belt. Our approach characterizes the information content of each image, taking into account relative variation in the pixel data, and resolution scale. In feature space, we find very good class separation using a multidimensional linear classifier. The innovation in this work includes (i) introducing an effective image-based approach into this application area, and (ii) our supervised classification using wavelet entropy-based features.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The contemporary dominance of visuality has turned our understanding of space into a mode of unidirectional experience that externalizes other sensual capacities of the body while perceiving the built environment. This affects not only architectural practice but also architectural education when an introduction to the concept of space is often challenging, especially for the students who have limited spatial and sensual training. Considering that an architectural work is not perceived as a series of retinal pictures but as a repeated multi-sensory experience, the problem definitions in the design studio need to be disengaged from the dominance of a ‘focused vision’ and be re-constructed in a holistic manner. A method to address this approach is to enable the students to refer to their own sensual experiences of the built environment as a part of their design processes. This paper focuses on a particular approach to the second year architectural design teaching which has been followed in the Department of Architecture at Izmir University of Economics for the last three years. The very first architectural project of the studio and the program, entitled ‘Sensing Spaces’, is conducted as a multi-staged design process including ‘sense games, analyses of organs and their interpretations into space’. The objectives of this four-week project are to explore the sense of space through the design of a three-dimensional assembly, to create an awareness of the significance of the senses in the design process and to experiment with re-interpreted forms of bodily parts. Hence, the students are encouraged to explore architectural space through their ‘tactile, olfactory, auditory, gustative and visual stimuli’. In this paper, based on a series of examples, architectural space is examined beyond its boundaries of structure, form and function, and spatial design is considered as an activity of re-constructing the built environment through the awareness of bodily senses.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The World Health Organization estimates that 13 million children aged 5-15 years worldwide are visually impaired from uncorrected refractive error. School vision screening programs can identify and treat or refer children with refractive error. We concentrate on the findings of various screening studies and attempt to identify key factors in the success and sustainability of such programs in the developing world. We reviewed original and review articles describing children's vision and refractive error screening programs published in English and listed in PubMed, Medline OVID, Google Scholar, and Oxford University Electronic Resources databases. Data were abstracted on study objective, design, setting, participants, and outcomes, including accuracy of screening, quality of refractive services, barriers to uptake, impact on quality of life, and cost-effectiveness of programs. Inadequately corrected refractive error is an important global cause of visual impairment in childhood. School-based vision screening carried out by teachers and other ancillary personnel may be an effective means of detecting affected children and improving their visual function with spectacles. The need for services and potential impact of school-based programs varies widely between areas, depending on prevalence of refractive error and competing conditions and rates of school attendance. Barriers to acceptance of services include the cost and quality of available refractive care and mistaken beliefs that glasses will harm children's eyes. Further research is needed in areas such as the cost-effectiveness of different screening approaches and impact of education to promote acceptance of spectacle-wear. School vision programs should be integrated into comprehensive efforts to promote healthy children and their families.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this work, we propose a biologically inspired appearance model for robust visual tracking. Motivated in part by the success of the hierarchical organization of the primary visual cortex (area V1), we establish an architecture consisting of five layers: whitening, rectification, normalization, coding and polling. The first three layers stem from the models developed for object recognition. In this paper, our attention focuses on the coding and pooling layers. In particular, we use a discriminative sparse coding method in the coding layer along with spatial pyramid representation in the pooling layer, which makes it easier to distinguish the target to be tracked from its background in the presence of appearance variations. An extensive experimental study shows that the proposed method has higher tracking accuracy than several state-of-the-art trackers.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

A scalable large vocabulary, speaker independent speech recognition system is being developed using Hidden Markov Models (HMMs) for acoustic modeling and a Weighted Finite State Transducer (WFST) to compile sentence, word, and phoneme models. The system comprises a software backend search and an FPGA-based Gaussian calculation which are covered here. In this paper, we present an efficient pipelined design implemented both as an embedded peripheral and as a scalable, parallel hardware accelerator. Both architectures have been implemented on an Alpha Data XRC-5T1, reconfigurable computer housing a Virtex 5 SX95T FPGA. The core has been tested and is capable of calculating a full set of Gaussian results from 3825 acoustic models in 9.03 ms which coupled with a backend search of 5000 words has provided an accuracy of over 80%. Parallel implementations have been designed with up to 32 cores and have been successfully implemented with a clock frequency of 133?MHz.