907 resultados para Visual Speaker Recognition, Visual Speech Recognition, Cascading Appearance-Based Features
Resumo:
Sensing the mental, physical and emotional demand of a driving task is of primary importance in road safety research and for effectively designing in-vehicle information systems (IVIS). Particularly, the need of cars capable of sensing and reacting to the emotional state of the driver has been repeatedly advocated in the literature. Algorithms and sensors to identify patterns of human behavior, such as gestures, speech, eye gaze and facial expression, are becoming available by using low cost hardware: This paper presents a new system which uses surrogate measures such as facial expression (emotion) and head pose and movements (intention) to infer task difficulty in a driving situation. 11 drivers were recruited and observed in a simulated driving task that involved several pre-programmed events aimed at eliciting emotive reactions, such as being stuck behind slower vehicles, intersections and roundabouts, and potentially dangerous situations. The resulting system, combining face expressions and head pose classification, is capable of recognizing dangerous events (such as crashes and near misses) and stressful situations (e.g. intersections and way giving) that occur during the simulated drive.
Resumo:
Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.
Resumo:
Place recognition has long been an incompletely solved problem in that all approaches involve significant compromises. Current methods address many but never all of the critical challenges of place recognition – viewpoint-invariance, condition-invariance and minimizing training requirements. Here we present an approach that adapts state-of-the-art object proposal techniques to identify potential landmarks within an image for place recognition. We use the astonishing power of convolutional neural network features to identify matching landmark proposals between images to perform place recognition over extreme appearance and viewpoint variations. Our system does not require any form of training, all components are generic enough to be used off-the-shelf. We present a range of challenging experiments in varied viewpoint and environmental conditions. We demonstrate superior performance to current state-of-the- art techniques. Furthermore, by building on existing and widely used recognition frameworks, this approach provides a highly compatible place recognition system with the potential for easy integration of other techniques such as object detection and semantic scene interpretation.
Resumo:
2,4,6-trinitrotoluene (TNT) is one of the most commonly used nitro aromatic explosives in landmine, military and mining industry. This article demonstrates rapid and selective identification of TNT by surface-enhanced Raman spectroscopy (SERS) using 6-aminohexanethiol (AHT) as a new recognition molecule. First, Meisenheimer complex formation between AHT and TNT is confirmed by the development of pink colour and appearance of new band around 500 nm in UV-visible spectrum. Solution Raman spectroscopy study also supported the AHT:TNT complex formation by demonstrating changes in the vibrational stretching of AHT molecule between 2800-3000 cm−1. For surface enhanced Raman spectroscopy analysis, a self-assembled monolayer (SAM) of AHT is formed over the gold nanostructure (AuNS) SERS substrate in order to selectively capture TNT onto the surface. Electrochemical desorption and X-ray photoelectron studies are performed over AHT SAM modified surface to examine the presence of free amine groups with appropriate orientation for complex formation. Further, AHT and butanethiol (BT) mixed monolayer system is explored to improve the AHT:TNT complex formation efficiency. Using a 9:1 AHT:BT mixed monolayer, a very low detection limit (LOD) of 100 fM TNT was realized. The new method delivers high selectivity towards TNT over 2,4 DNT and picric acid. Finally, real sample analysis is demonstrated by the extraction and SERS detection of 302 pM of TNT from spiked.
Resumo:
The minimum cost classifier when general cost functionsare associated with the tasks of feature measurement and classification is formulated as a decision graph which does not reject class labels at intermediate stages. Noting its complexities, a heuristic procedure to simplify this scheme to a binary decision tree is presented. The optimizationof the binary tree in this context is carried out using ynamicprogramming. This technique is applied to the voiced-unvoiced-silence classification in speech processing.
Resumo:
A galactose-specific protein (RC1) isolated from Ricinus communis beans was found to give a precipitin reaction with concanavalin A. Its carbohydrate content amounted to 8–9% of the total protein and was found to be rich in mannose. The interaction of RC1 with galactose and lactose was measured in 0.05 M phosphate buffer containing 0.2 M NaCl (pH 6.8) by the method of conventional equilibrium dialysis. From the analysis of the binding data according to Scatchard method the association constant (Ka) at 5°C was calculated as 3.8 mM−1 and 1.2 mM−1 for lactose and galactose, respectively. In both cases the number of binding sites per molecule of RC1 with molecular weight of 120000 was found to be 2. From the temperature-dependent Ka values for the binding of lactose, the values of –5.7 kcal/mol and –4.3 cal × mol−1× K−1 were calculated for ΔH and ΔS, respectively. The addition of concanavalin A to RC1 or vice versa led to the formation of the insoluble complex RC1· ConA4 containing one molecule of RC1 and one molecule of tetrameric concanavalin A (ConA4) which could be dissociated upon addition of concanavalin A-specific sugars. The complex formation results in a time-dependent appearance of turbidity in the time range from 10s to 10 min. From the measurement of the time-dependent appearance and disappearance of the turbidity the formation (kf) and dissociation (kd) rate constants were calculated as 3 mM−1× s−1 and 0.07 ks−1 respectively. The ratio kf/kd (43μM −1), that corresponds to the association constant of complex RC1· ConA4, is higher than that of mannoside · ConA4 and thereby suggests that protein-protein interaction contributes significantly in stabilising glycoprotein · lectin complexes. The relevance of this finding to the understanding of the chemical specificities that are involved in a model cell-lectin interaction is discussed.
Resumo:
Detect and Avoid (DAA) technology is widely acknowledged as a critical enabler for unsegregated Remote Piloted Aircraft (RPA) operations, particularly Beyond Visual Line of Sight (BVLOS). Image-based DAA, in the visible spectrum, is a promising technological option for addressing the challenges DAA presents. Two impediments to progress for this approach are the scarcity of available video footage to train and test algorithms, in conjunction with testing regimes and specifications which facilitate repeatable, statistically valid, performance assessment. This paper includes three key contributions undertaken to address these impediments. In the first instance, we detail our progress towards the creation of a large hybrid collision and near-collision encounter database. Second, we explore the suitability of techniques employed by the biometric research community (Speaker Verification and Language Identification), for DAA performance optimisation and assessment. These techniques include Detection Error Trade-off (DET) curves, Equal Error Rates (EER), and the Detection Cost Function (DCF). Finally, the hybrid database and the speech-based techniques are combined and employed in the assessment of a contemporary, image based DAA system. This system includes stabilisation, morphological filtering and a Hidden Markov Model (HMM) temporal filter.
Resumo:
Parallel sub-word recognition (PSWR) is a new model that has been proposed for language identification (LID) which does not need elaborate phonetic labeling of the speech data in a foreign language. The new approach performs a front-end tokenization in terms of sub-word units which are designed by automatic segmentation, segment clustering and segment HMM modeling. We develop PSWR based LID in a framework similar to the parallel phone recognition (PPR) approach in the literature. This includes a front-end tokenizer and a back-end language model, for each language to be identified. Considering various combinations of the statistical evaluation scores, it is found that PSWR can perform as well as PPR, even with broad acoustic sub-word tokenization, thus making it an efficient alternative to the PPR system.
Resumo:
Bhutani N, Ray S, Murthy A. Is saccade averaging determined by visual processing or movement planning? J Neurophysiol 108: 3161-3171, 2012. First published September 26, 2012; doi:10.1152/jn.00344.2012.-Saccadic averaging that causes subjects' gaze to land between the location of two targets when faced with simultaneously or sequentially presented stimuli has been often used as a probe to investigate the nature of computations that transform sensory representations into an oculomotor plan. Since saccadic movements involve at least two processing stages-a visual stage that selects a target and a movement stage that prepares the response-saccade averaging can either occur due to interference in visual processing or movement planning. By having human subjects perform two versions of a saccadic double-step task, in which the stimuli remained the same, but different instructions were provided (REDIRECT gaze to the later-appearing target vs. FOLLOW the sequence of targets in their order of appearance), we tested two alternative hypotheses. If saccade averaging were due to visual processing alone, the pattern of saccade averaging is expected to remain the same across task conditions. However, whereas subjects produced averaged saccades between two targets in the FOLLOW condition, they produced hypometric saccades in the direction of the initial target in the REDIRECT condition, suggesting that the interaction between competing movement plans produces saccade averaging.
Resumo:
We address the problem of multi-instrument recognition in polyphonic music signals. Individual instruments are modeled within a stochastic framework using Student's-t Mixture Models (tMMs). We impose a mixture of these instrument models on the polyphonic signal model. No a priori knowledge is assumed about the number of instruments in the polyphony. The mixture weights are estimated in a latent variable framework from the polyphonic data using an Expectation Maximization (EM) algorithm, derived for the proposed approach. The weights are shown to indicate instrument activity. The output of the algorithm is an Instrument Activity Graph (IAG), using which, it is possible to find out the instruments that are active at a given time. An average F-ratio of 0 : 7 5 is obtained for polyphonies containing 2-5 instruments, on a experimental test set of 8 instruments: clarinet, flute, guitar, harp, mandolin, piano, trombone and violin.
Resumo:
This paper discusses a novel high-speed approach for human action recognition in H. 264/AVC compressed domain. The proposed algorithm utilizes cues from quantization parameters and motion vectors extracted from the compressed video sequence for feature extraction and further classification using Support Vector Machines (SVM). The ultimate goal of our work is to portray a much faster algorithm than pixel domain counterparts, with comparable accuracy, utilizing only the sparse information from compressed video. Partial decoding rules out the complexity of full decoding, and minimizes computational load and memory usage, which can effect in reduced hardware utilization and fast recognition results. The proposed approach can handle illumination changes, scale, and appearance variations, and is robust in outdoor as well as indoor testing scenarios. We have tested our method on two benchmark action datasets and achieved more than 85% accuracy. The proposed algorithm classifies actions with speed (>2000 fps) approximately 100 times more than existing state-of-the-art pixel-domain algorithms.
Resumo:
Designing a robust algorithm for visual object tracking has been a challenging task since many years. There are trackers in the literature that are reasonably accurate for many tracking scenarios but most of them are computationally expensive. This narrows down their applicability as many tracking applications demand real time response. In this paper, we present a tracker based on random ferns. Tracking is posed as a classification problem and classification is done using ferns. We used ferns as they rely on binary features and are extremely fast at both training and classification as compared to other classification algorithms. Our experiments show that the proposed tracker performs well on some of the most challenging tracking datasets and executes much faster than one of the state-of-the-art trackers, without much difference in tracking accuracy.
Resumo:
This paper discusses a novel high-speed approach for human action recognition in H.264/AVC compressed domain. The proposed algorithm utilizes cues from quantization parameters and motion vectors extracted from the compressed video sequence for feature extraction and further classification using Support Vector Machines (SVM). The ultimate goal of the proposed work is to portray a much faster algorithm than pixel domain counterparts, with comparable accuracy, utilizing only the sparse information from compressed video. Partial decoding rules out the complexity of full decoding, and minimizes computational load and memory usage, which can result in reduced hardware utilization and faster recognition results. The proposed approach can handle illumination changes, scale, and appearance variations, and is robust to outdoor as well as indoor testing scenarios. We have evaluated the performance of the proposed method on two benchmark action datasets and achieved more than 85 % accuracy. The proposed algorithm classifies actions with speed (> 2,000 fps) approximately 100 times faster than existing state-of-the-art pixel-domain algorithms.
Resumo:
This paper describes the development of the 2003 CU-HTK large vocabulary speech recognition system for Conversational Telephone Speech (CTS). The system was designed based on a multi-pass, multi-branch structure where the output of all branches is combined using system combination. A number of advanced modelling techniques such as Speaker Adaptive Training, Heteroscedastic Linear Discriminant Analysis, Minimum Phone Error estimation and specially constructed Single Pronunciation dictionaries were employed. The effectiveness of each of these techniques and their potential contribution to the result of system combination was evaluated in the framework of a state-of-the-art LVCSR system with sophisticated adaptation. The final 2003 CU-HTK CTS system constructed from some of these models is described and its performance on the DARPA/NIST 2003 Rich Transcription (RT-03) evaluation test set is discussed.