932 resultados para temporal lip information
Resumo:
Investigates the use of temporal lip information, in conjunction with speech information, for robust, text-dependent speaker identification. We propose that significant speaker-dependent information can be obtained from moving lips, enabling speaker recognition systems to be highly robust in the presence of noise. The fusion structure for the audio and visual information is based around the use of multi-stream hidden Markov models (MSHMM), with audio and visual features forming two independent data streams. Recent work with multi-modal MSHMMs has been performed successfully for the task of speech recognition. The use of temporal lip information for speaker identification has been performed previously (T.J. Wark et al., 1998), however this has been restricted to output fusion via single-stream HMMs. We present an extension to this previous work, and show that a MSHMM is a valid structure for multi-modal speaker identification
Resumo:
The performance of automatic speech recognition systems deteriorates in the presence of noise. One known solution is to incorporate video information with an existing acoustic speech recognition system. We investigate the performance of the individual acoustic and visual sub-systems and then examine different ways in which the integration of the two systems may be performed. The system is to be implemented in real time on a Texas Instruments' TMS320C80 DSP.
Resumo:
Cognitive functioning is based on binding processes, by which different features and elements of neurocognition are integrated and coordinated. Binding is an essential ingredient of, for instance, Gestalt perception. We have implemented a paradigm of causality perception based on the work of Albert Michotte, in which 2 identical discs move from opposite sides of a monitor, steadily toward, and then past one another. Their coincidence generates an ambiguous percept of either "streaming" or "bouncing," which the subjects (34 schizophrenia spectrum patients and 34 controls with mean age 27.9 y) were instructed to report. The latter perception is a marker of the binding processes underlying perceived causality (type I binding). In addition to this visual task, acoustic stimuli were presented at different times during the task (150 ms before and after visual coincidence), which can modulate perceived causality. This modulation by intersensory and temporally delayed stimuli is viewed as a different type of binding (type II). We show here, using a mixed-effects hierarchical analysis, that type II binding distinguishes schizophrenia spectrum patients from healthy controls, whereas type I binding does not. Type I binding may even be excessive in some patients, especially those with positive symptoms; Type II binding, however, was generally attenuated in patients. The present findings point to ways in which the disconnection (or Gestalt) hypothesis of schizophrenia can be refined, suggesting more specific markers of neurocognitive functioning and potential targets of treatment.
Resumo:
Three experiments examined children’s and adults’ abilities to use statistical and temporal information to distinguish between common cause and causal chain structures. In Experiment 1, participants were provided with conditional probability information and/or temporal information and asked to infer the causal structure of a three-variable mechanical system that operated probabilistically. Participants of all ages preferentially relied on the temporal pattern of events in their inferences, even if this conflicted with statistical information. In Experiments 2 and 3, participants observed a series of interventions on the system, which in these experiments operated deterministically. In Experiment 2, participants found it easier to use temporal pattern information than statistical information provided as a result of interventions. In Experiment 3, in which no temporal pattern information was provided, children from 6-7 years, but not younger children, were able to use intervention information to make causal chain judgments, although they had difficulty when the structure was a common cause. The findings suggest that participants, and children in particular, may find it more difficult to use statistical information than temporal pattern information because of its demands on information processing resources. However, there may also be an inherent preference for temporal information.
Resumo:
Acoustically, car cabins are extremely noisy and as a consequence audio-only, in-car voice recognition systems perform poorly. As the visual modality is immune to acoustic noise, using the visual lip information from the driver is seen as a viable strategy in circumventing this problem by using audio visual automatic speech recognition (AVASR). However, implementing AVASR requires a system being able to accurately locate and track the drivers face and lip area in real-time. In this paper we present such an approach using the Viola-Jones algorithm. Using the AVICAR [1] in-car database, we show that the Viola- Jones approach is a suitable method of locating and tracking the driver’s lips despite the visual variability of illumination and head pose for audio-visual speech recognition system.
Resumo:
This paper investigates the use of lip information, in conjunction with speech information, for robust speaker verification in the presence of background noise. It has been previously shown in our own work, and in the work of others, that features extracted from a speaker's moving lips hold speaker dependencies which are complementary with speech features. We demonstrate that the fusion of lip and speech information allows for a highly robust speaker verification system which outperforms the performance of either sub-system. We present a new technique for determining the weighting to be applied to each modality so as to optimize the performance of the fused system. Given a correct weighting, lip information is shown to be highly effective for reducing the false acceptance and false rejection error rates in the presence of background noise
Resumo:
Investigates the use of lip information, in conjunction with speech information, for robust speaker verification in the presence of background noise. We have previously shown (Int. Conf. on Acoustics, Speech and Signal Proc., vol. 6, pp. 3693-3696, May 1998) that features extracted from a speaker's moving lips hold speaker dependencies which are complementary with speech features. We demonstrate that the fusion of lip and speech information allows for a highly robust speaker verification system which outperforms either subsystem individually. We present a new technique for determining the weighting to be applied to each modality so as to optimize the performance of the fused system. Given a correct weighting, lip information is shown to be highly effective for reducing the false acceptance and false rejection error rates in the presence of background noise
Resumo:
In this paper we tackle the problem of efficient video event detection. We argue that linear detection functions should be preferred in this regard due to their scalability and efficiency during estimation and evaluation. A popular approach in this regard is to represent a sequence using a bag of words (BOW) representation due to its: (i) fixed dimensionality irrespective of the sequence length, and (ii) its ability to compactly model the statistics in the sequence. A drawback to the BOW representation, however, is the intrinsic destruction of the temporal ordering information. In this paper we propose a new representation that leverages the uncertainty in relative temporal alignments between pairs of sequences while not destroying temporal ordering. Our representation, like BOW, is of a fixed dimensionality making it easily integrated with a linear detection function. Extensive experiments on CK+, 6DMG, and UvA-NEMO databases show significant performance improvements across both isolated and continuous event detection tasks.
Resumo:
First-order temporal logic is a concise and powerful notation, with many potential applications in both Computer Science and Artificial Intelligence. While the full logic is highly complex, recent work on monodic first-order temporal logics has identified important enumerable and even decidable fragments. Although a complete and correct resolution-style calculus has already been suggested for this specific fragment, this calculus involves constructions too complex to be of practical value. In this paper, we develop a machine-oriented clausal resolution method which features radically simplified proof search. We first define a normal form for monodic formulae and then introduce a novel resolution calculus that can be applied to formulae in this normal form. By careful encoding, parts of the calculus can be implemented using classical first-order resolution and can, thus, be efficiently implemented. We prove correctness and completeness results for the calculus and illustrate it on a comprehensive example. An implementation of the method is briefly discussed.
Resumo:
Clusters of temporal optical solitons—stable self-localized light pulses preserving their form during propagation—exhibit properties characteristic of that encountered in crystals. Here, we introduce the concept of temporal solitonic information crystals formed by the lattices of optical pulses with variable phases. The proposed general idea offers new approaches to optical coherent transmission technology and can be generalized to dispersion-managed and dissipative solitons as well as scaled to a variety of physical platforms from fiber optics to silicon chips. We discuss the key properties of such dynamic temporal crystals that mathematically correspond to non-Hermitian lattices and examine the types of collective mode instabilities determining the lifetime of the soliton train. This transfer of techniques and concepts from solid state physics to information theory promises a new outlook on information storage and transmission.
Resumo:
Acoustically, vehicles are extremely noisy environments and as a consequence audio-only in-car voice recognition systems perform very poorly. Seeing that the visual modality is immune to acoustic noise, using the visual lip information from the driver is seen as a viable strategy in circumventing this problem. However, implementing such an approach requires a system being able to accurately locate and track the driver’s face and facial features in real-time. In this paper we present such an approach using the Viola-Jones algorithm. Using this system, we present our results which show that using the Viola-Jones approach is a suitable method of locating and tracking the driver’s lips despite the visual variability of illumination and head pose.
Resumo:
In recent years, some models have been proposed for the fault section estimation and state identification of unobserved protective relays (FSE-SIUPR) under the condition of incomplete state information of protective relays. In these models, the temporal alarm information from a faulted power system is not well explored although it is very helpful in compensating the incomplete state information of protective relays, quickly achieving definite fault diagnosis results and evaluating the operating status of protective relays and circuit breakers in complicated fault scenarios. In order to solve this problem, an integrated optimization mathematical model for the FSE-SIUPR, which takes full advantage of the temporal characteristics of alarm messages, is developed in the framework of the well-established temporal constraint network. With this model, the fault evolution procedure can be explained and some states of unobserved protective relays identified. The model is then solved by means of the Tabu search (TS) and finally verified by test results of fault scenarios in a practical power system.
Resumo:
Management of groundwater systems requires realistic conceptual hydrogeological models as a framework for numerical simulation modelling, but also for system understanding and communicating this to stakeholders and the broader community. To help overcome these challenges we developed GVS (Groundwater Visualisation System), a stand-alone desktop software package that uses interactive 3D visualisation and animation techniques. The goal was a user-friendly groundwater management tool that could support a range of existing real-world and pre-processed data, both surface and subsurface, including geology and various types of temporal hydrological information. GVS allows these data to be integrated into a single conceptual hydrogeological model. In addition, 3D geological models produced externally using other software packages, can readily be imported into GVS models, as can outputs of simulations (e.g. piezometric surfaces) produced by software such as MODFLOW or FEFLOW. Boreholes can be integrated, showing any down-hole data and properties, including screen information, intersected geology, water level data and water chemistry. Animation is used to display spatial and temporal changes, with time-series data such as rainfall, standing water levels and electrical conductivity, displaying dynamic processes. Time and space variations can be presented using a range of contouring and colour mapping techniques, in addition to interactive plots of time-series parameters. Other types of data, for example, demographics and cultural information, can also be readily incorporated. The GVS software can execute on a standard Windows or Linux-based PC with a minimum of 2 GB RAM, and the model output is easy and inexpensive to distribute, by download or via USB/DVD/CD. Example models are described here for three groundwater systems in Queensland, northeastern Australia: two unconfined alluvial groundwater systems with intensive irrigation, the Lockyer Valley and the upper Condamine Valley, and the Surat Basin, a large sedimentary basin of confined artesian aquifers. This latter example required more detail in the hydrostratigraphy, correlation of formations with drillholes and visualisation of simulation piezometric surfaces. Both alluvial system GVS models were developed during drought conditions to support government strategies to implement groundwater management. The Surat Basin model was industry sponsored research, for coal seam gas groundwater management and community information and consultation. The “virtual” groundwater systems in these 3D GVS models can be interactively interrogated by standard functions, plus production of 2D cross-sections, data selection from the 3D scene, rear end database and plot displays. A unique feature is that GVS allows investigation of time-series data across different display modes, both 2D and 3D. GVS has been used successfully as a tool to enhance community/stakeholder understanding and knowledge of groundwater systems and is of value for training and educational purposes. Projects completed confirm that GVS provides a powerful support to management and decision making, and as a tool for interpretation of groundwater system hydrological processes. A highly effective visualisation output is the production of short videos (e.g. 2–5 min) based on sequences of camera ‘fly-throughs’ and screen images. Further work involves developing support for multi-screen displays and touch-screen technologies, distributed rendering, gestural interaction systems. To highlight the visualisation and animation capability of the GVS software, links to related multimedia hosted online sites are included in the references.
Resumo:
In this paper we present a novel place recognition algorithm inspired by recent discoveries in human visual neuroscience. The algorithm combines intolerant but fast low resolution whole image matching with highly tolerant, sub-image patch matching processes. The approach does not require prior training and works on single images (although we use a cohort normalization score to exploit temporal frame information), alleviating the need for either a velocity signal or image sequence, differentiating it from current state of the art methods. We demonstrate the algorithm on the challenging Alderley sunny day – rainy night dataset, which has only been previously solved by integrating over 320 frame long image sequences. The system is able to achieve 21.24% recall at 100% precision, matching drastically different day and night-time images of places while successfully rejecting match hypotheses between highly aliased images of different places. The results provide a new benchmark for single image, condition-invariant place recognition.