918 resultados para Visual Word-recognition


Relevância:

90.00% 90.00%

Publicador:

Resumo:

This thesis addresses the problem of detecting and describing the same scene points in different wide-angle images taken by the same camera at different viewpoints. This is a core competency of many vision-based localisation tasks including visual odometry and visual place recognition. Wide-angle cameras have a large field of view that can exceed a full hemisphere, and the images they produce contain severe radial distortion. When compared to traditional narrow field of view perspective cameras, more accurate estimates of camera egomotion can be found using the images obtained with wide-angle cameras. The ability to accurately estimate camera egomotion is a fundamental primitive of visual odometry, and this is one of the reasons for the increased popularity in the use of wide-angle cameras for this task. Their large field of view also enables them to capture images of the same regions in a scene taken at very different viewpoints, and this makes them suited for visual place recognition. However, the ability to estimate the camera egomotion and recognise the same scene in two different images is dependent on the ability to reliably detect and describe the same scene points, or ‘keypoints’, in the images. Most algorithms used for this purpose are designed almost exclusively for perspective images. Applying algorithms designed for perspective images directly to wide-angle images is problematic as no account is made for the image distortion. The primary contribution of this thesis is the development of two novel keypoint detectors, and a method of keypoint description, designed for wide-angle images. Both reformulate the Scale- Invariant Feature Transform (SIFT) as an image processing operation on the sphere. As the image captured by any central projection wide-angle camera can be mapped to the sphere, applying these variants to an image on the sphere enables keypoints to be detected in a manner that is invariant to image distortion. Each of the variants is required to find the scale-space representation of an image on the sphere, and they differ in the approaches they used to do this. Extensive experiments using real and synthetically generated wide-angle images are used to validate the two new keypoint detectors and the method of keypoint description. The best of these two new keypoint detectors is applied to vision based localisation tasks including visual odometry and visual place recognition using outdoor wide-angle image sequences. As part of this work, the effect of keypoint coordinate selection on the accuracy of egomotion estimates using the Direct Linear Transform (DLT) is investigated, and a simple weighting scheme is proposed which attempts to account for the uncertainty of keypoint positions during detection. A word reliability metric is also developed for use within a visual ‘bag of words’ approach to place recognition.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The cascading appearance-based (CAB) feature extraction technique has established itself as the state-of-the-art in extracting dynamic visual speech features for speech recognition. In this paper, we will focus on investigating the effectiveness of this technique for the related speaker verification application. By investigating the speaker verification ability of each stage of the cascade we will demonstrate that the same steps taken to reduce static speaker and environmental information for the visual speech recognition application also provide similar improvements for visual speaker recognition. A further study is conducted comparing synchronous HMM (SHMM) based fusion of CAB visual features and traditional perceptual linear predictive (PLP) acoustic features to show that higher complexity inherit in the SHMM approach does not appear to provide any improvement in the final audio-visual speaker verification system over simpler utterance level score fusion.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Voice recognition is one of the key enablers to reduce driver distraction as in-vehicle systems become more and more complex. With the integration of voice recognition in vehicles, safety and usability are improved as the driver’s eyes and hands are not required to operate system controls. Whilst speaker independent voice recognition is well developed, performance in high noise environments (e.g. vehicles) is still limited. La Trobe University and Queensland University of Technology have developed a low-cost hardware-based speech enhancement system for automotive environments based on spectral subtraction and delay–sum beamforming techniques. The enhancement algorithms have been optimised using authentic Australian English collected under typical driving conditions. Performance tests conducted using speech data collected under variety of vehicle noise conditions demonstrate a word recognition rate improvement in the order of 10% or more under the noisiest conditions. Currently developed to a proof of concept stage there is potential for even greater performance improvement.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Micro aerial vehicles (MAVs) are a rapidly growing area of research and development in robotics. For autonomous robot operations, localization has typically been calculated using GPS, external camera arrays, or onboard range or vision sensing. In cluttered indoor or outdoor environments, onboard sensing is the only viable option. In this paper we present an appearance-based approach to visual SLAM on a flying MAV using only low quality vision. Our approach consists of a visual place recognition algorithm that operates on 1000 pixel images, a lightweight visual odometry algorithm, and a visual expectation algorithm that improves the recall of place sequences and the precision with which they are recalled as the robot flies along a similar path. Using data gathered from outdoor datasets, we show that the system is able to perform visual recognition with low quality, intermittent visual sensory data. By combining the visual algorithms with the RatSLAM system, we also demonstrate how the algorithms enable successful SLAM.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The performance of visual speech recognition (VSR) systems are significantly influenced by the accuracy of the visual front-end. The current state-of-the-art VSR systems use off-the-shelf face detectors such as Viola- Jones (VJ) which has limited reliability for changes in illumination and head poses. For a VSR system to perform well under these conditions, an accurate visual front end is required. This is an important problem to be solved in many practical implementations of audio visual speech recognition systems, for example in automotive environments for an efficient human-vehicle computer interface. In this paper, we re-examine the current state-of-the-art VSR by comparing off-the-shelf face detectors with the recently developed Fourier Lucas-Kanade (FLK) image alignment technique. A variety of image alignment and visual speech recognition experiments are performed on a clean dataset as well as with a challenging automotive audio-visual speech dataset. Our results indicate that the FLK image alignment technique can significantly outperform off-the shelf face detectors, but requires frequent fine-tuning.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Spoken term detection (STD) is the task of looking up a spoken term in a large volume of speech segments. In order to provide fast search, speech segments are first indexed into an intermediate representation using speech recognition engines which provide multiple hypotheses for each speech segment. Approximate matching techniques are usually applied at the search stage to compensate the poor performance of automatic speech recognition engines during indexing. Recently, using visual information in addition to audio information has been shown to improve phone recognition performance, particularly in noisy environments. In this paper, we will make use of visual information in the form of lip movements of the speaker in indexing stage and will investigate its effect on STD performance. Particularly, we will investigate if gains in phone recognition accuracy will carry through the approximate matching stage to provide similar gains in the final audio-visual STD system over a traditional audio only approach. We will also investigate the effect of using visual information on STD performance in different noise environments.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We are addressing the problem of jointly using multiple noisy speech patterns for automatic speech recognition (ASR), given that they come from the same class. If the user utters a word K times, the ASR system should try to use the information content in all the K patterns of the word simultaneously and improve its speech recognition accuracy compared to that of the single pattern based speech recognition. T address this problem, recently we proposed a Multi Pattern Dynamic Time Warping (MPDTW) algorithm to align the K patterns by finding the least distortion path between them. A Constrained Multi Pattern Viterbi algorithm was used on this aligned path for isolated word recognition (IWR). In this paper, we explore the possibility of using only the MPDTW algorithm for IWR. We also study the properties of the MPDTW algorithm. We show that using only 2 noisy test patterns (10 percent burst noise at -5 dB SNR) reduces the noisy speech recognition error rate by 37.66 percent when compared to the single pattern recognition using the Dynamic Time Warping algorithm.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We are addressing the novel problem of jointly evaluating multiple speech patterns for automatic speech recognition and training. We propose solutions based on both the non-parametric dynamic time warping (DTW) algorithm, and the parametric hidden Markov model (HMM). We show that a hybrid approach is quite effective for the application of noisy speech recognition. We extend the concept to HMM training wherein some patterns may be noisy or distorted. Utilizing the concept of ``virtual pattern'' developed for joint evaluation, we propose selective iterative training of HMMs. Evaluating these algorithms for burst/transient noisy speech and isolated word recognition, significant improvement in recognition accuracy is obtained using the new algorithms over those which do not utilize the joint evaluation strategy.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We are addressing a new problem of improving automatic speech recognition performance, given multiple utterances of patterns from the same class. We have formulated the problem of jointly decoding K multiple patterns given a single Hidden Markov Model. It is shown that such a solution is possible by aligning the K patterns using the proposed Multi Pattern Dynamic Time Warping algorithm followed by the Constrained Multi Pattern Viterbi Algorithm The new formulation is tested in the context of speaker independent isolated word recognition for both clean and noisy patterns. When 10 percent of speech is affected by a burst noise at -5 dB Signal to Noise Ratio (local), it is shown that joint decoding using only two noisy patterns reduces the noisy speech recognition error rate to about 51 percent, when compared to the single pattern decoding using the Viterbi Algorithm. In contrast a simple maximization of individual pattern likelihoods, provides only about 7 percent reduction in error rate.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this article, we aim at reducing the error rate of the online Tamil symbol recognition system by employing multiple experts to reevaluate certain decisions of the primary support vector machine classifier. Motivated by the relatively high percentage of occurrence of base consonants in the script, a reevaluation technique has been proposed to correct any ambiguities arising in the base consonants. Secondly, a dynamic time-warping method is proposed to automatically extract the discriminative regions for each set of confused characters. Class-specific features derived from these regions aid in reducing the degree of confusion. Thirdly, statistics of specific features are proposed for resolving any confusions in vowel modifiers. The reevaluation approaches are tested on two databases (a) the isolated Tamil symbols in the IWFHR test set, and (b) the symbols segmented from a set of 10,000 Tamil words. The recognition rate of the isolated test symbols of the IWFHR database improves by 1.9 %. For the word database, the incorporation of the reevaluation step improves the symbol recognition rate by 3.5 % (from 88.4 to 91.9 %). This, in turn, boosts the word recognition rate by 11.9 % (from 65.0 to 76.9 %). The reduction in the word error rate has been achieved using a generic approach, without the incorporation of language models.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

First responders are in danger when they perform tasks in damaged buildings after earthquakes. Structural collapse due to the failure of critical load bearing structural members (e.g. columns) during a post-earthquake event such as an aftershock can make first responders victims, considering they are unable to assess the impact of the damage inflicted in load bearing members. The writers here propose a method that can provide first responders with a crude but quick estimate of the damage inflicted in load bearing members. Under the proposed method, critical structural members (reinforced concrete columns in this study) are identified from digital visual data and the damage superimposed on these structural members is detected with the help of Visual Pattern Recognition techniques. The correlation of the two (e.g. the position, orientation and size of a crack on the surface of a column) is used to query a case-based reasoning knowledge base, which contains apriori classified states of columns according to the damage inflicted on them. When query results indicate the column's damage state is severe, the method assumes that a structural collapse is likely and first responders are warned to evacuate.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Reading is an important human-specific skill obtained through extensive learning experience and is reliance on the ability to rapidly recognize single words. According to the behavioral studies, the most important stage of reading is the representation of “visual word form”, which is independent on surface visual features of the reading materials. The prelexical visual word form representation is characterized by the abstractive and highly effective and precise processing. Neuroimaging and neuropsychological studies have investigated the neural basis underlying the visual word form processing. On the basis of summary of the existing literature, the current thesis aimed to address three fundamental questions involving neural basis of word recognition. First, is there a dedicated neural network that is specialized for word recognition? Second, is the orthographic information represented in the putative word/character selective region (VWFA)? Third, what is the role of reading experience in the genesis of the VWFA, is experience a main driver to shape VWFA instead of evolutionary selectivity? Nineteen Chinese literate volunteers, 5 Chinese illiterates and 4 native English speakers participated in this study, and performed perceptual tasks during fMRI scanning. To address the first question, we compared the differential responses to three categories of visual objects, i.e., faces, line drawings of objects and Chinese characters, and defined the region of interesting (ROI) for the next experiment. To address the second question, Chinese character orthography was manipulated to reveal possible differential responses to real characters, false characters, radical combinations, and stroke combinations in the regions defined by the first experiment. To examine the role of reading experience in genesis of specialization for character, the responses for unfamiliar Chinese characters in Chinese illiterates and native English speakers were compared with that in the Chinese literates, and tracked the change in cortical activation after a short-term reading training in the illiterates. Data were analyzed in two dimensions. Both BOLD signal amplitude and spatial distribution pattern among multi-voxels were used to systematically investigate the responsiveness of the left fusiform gyrus to Chinese characters. Our results provide strong and clear evidence for the existence of functionally specialized regions in the human ventral occipital-temporal cortex. In the skilled readers a region specialized for written words could be consistently found in the lateral part of the left fusiform gyrus, line drawings in the median part and faces in the middle. Our results further show that spatial distribution analysis, a method that was not commonly used in neuroimaging of reading, appears to be a more effective measurement for category specialization for visual objects processing. Although we failed to provide evidence that VWFA processes orthographic information in terms of signal intensitiy, we do show that response pattern of real characters and radical collections in this area is different from that of false characters and random stroke combinations. Our last set of experiments suggests that the selective bias to reading material is clearly experience dependent. The response to unknown characters in both English speakers/readers and Chinese illiterates is fundamentally different from that of the skilled Chinese readers. The response pattern for unknown characters is more similar to that for line drawings rather as a weak version of character in skilled Chinese readers. Short-term training is not sufficient to produce VWFA bias even when tested with learned characters, rather the learned characters generated a overall upward shift of the activation of the left fusiform region. Formation of a dedicated region specialized for visual word/character might depend on long-term extensive reading experience, or there might be a critical period for reading acquisition.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A number of functional neuroimaging studies with skilled readers consistently showed activation to visual words in the left mid-fusiform cortex in occipitotemporal sulcus (LMFC-OTS). Neuropsychological studies also showed that lesions at left ventral occipitotemporal areas result in impairment in visual word processing. Based on these empirical observations and some theoretical speculations, a few researchers postulated that the LMFC-OTS is responsible for instant parallel and holistic extraction of the abstract representation of letter strings, and labeled this piece of cortex as “visual word form area” (VWFA). Nonetheless, functional neuroimaging studies alone is basically a correlative rather than causal approach, and lesions in the previous studies were typically not constrained within LMFC-OTS but also involving other brain regions beyond this area. Given these limitations, it remains unanswered for three fundamental questions: is LMFC-OTS necessary for visual word processing? is this functionally selective for visual word processing while unnecessary for processing of non-visual word stimuli? what are its function properties in visual word processing? This thesis aimed to address these questions through a series of neuropsychological, anatomical and functional MRI experiments in four patients with different degrees of impairments in the left fusiform gyrus. Necessity: Detailed analysis of anatomical brain images revealed that the four patients had differential foci of brain infarction. Specifically, the LMFC-OTS was damaged in one patient, while it remained intact in the other three. Neuropsychological experiments showed that the patient with lesions in the LMFC-OTS had severe impairments in reading aloud and recognizing Chinese characters, i.e., pure alexia. The patient with intact LMFC-OTS but information from the left visual field (LVF) was blocked due to lesions in the splenium of corpus callosum, showed impairment in Chinese characters recognition when the stimuli were presented in the LVF but not in the RVF, i.e. left hemialexia. In contrast, the other two patients with intact LMFC-OTS had normal function in processing Chinese characters. The fMRI experiments demonstrated that there was no significant activation to Chinese characters in the LMFC-OTS of the pure alexic patient and of the patient with left hemialexia when the stimuli were presented in the LVF. On the other hand, this patient, when Chinese characters were presented in right visual field, and the other two with intact LMFC-OTS had activation in the LMFC-OTS. These results together point to the necessity of the LMFC-OTS for Chinese character processing. Selectivity: We tested selectivity of the LMFC-OTS for visual word processing through systematically examining the patients’ ability for processing visual vs. auditory words, and word vs. non-word visual stimuli, such as faces, objects and colors. Results showed that the pure alexic patients could normally process auditory words (expression, understanding and repetition of orally presented words) and non-word visual stimuli (faces, objects, colors and numbers). Although the patient showed some impairments in naming faces, objects and colors, his performance scores were only slightly lower or not significantly different relative to those of the patients with intact LMFC-OTS. These data provide compelling evidence that the LMFC-OTS is not requisite for processing non-visual word stimuli, thus has selectivity for visual word processing. Functional properties: With tasks involving multiple levels and aspects of word processing, including Chinese character reading, phonological judgment, semantic judgment, identity judgment of abstract visual word representation, lexical decision, perceptual judgment of visual word appearance, and dictation, copying, voluntary writing, etc., we attempted to reveal the most critical dysfunction caused by damage in the LMFC-OTS, thus to clarify the most essential function of this region. Results showed that in addition to dysfunctions in Chinese character reading, phonological and semantic judgment, the patient with lesions at LMFC-OTS failed to judge correctly whether two characters (including compound and simple characters) with different surface features (e.g., different fonts, printed vs. handwritten vs. calligraphy styles, simplified characters vs. traditional characters, different orientations of strokes or whole characters) had the same abstract representation. The patient initially showed severe impairments in processing both simple characters and compound characters. He could only copy a compound character in a stroke-by-stroke manner, but not by character-by-character or even by radical-by-radical manners. During the recovery process, namely five months later, the patient could complete the abstract representation tasks of simple characters, but showed no improvement for compound characters. However, he then could copy compound characters in a radical-by-radical manner. Furthermore, it seems that the recovery of copying paralleled to that of judgment of abstract representation. These observations indicate that lesions of the LMFC-OTS in the pure alexic patients caused several damage in the ability of extracting the abstract representation from lower level units to higher level units, and the patient had especial difficulty to extract the abstract representation of whole character from its secondary units (e.g., radicals or single characters) and this ability was resistant to recover from impairment. Therefore, the LMFC-OTS appears to be responsible for the multilevel (particularly higher levels) abstract representations of visual word form. Successful extraction seems independent on access to phonological and semantic information, given the alexic patient showed severe impairments in reading aloud and semantic processing on simple characters while maintenance of intact judgment on their abstract representation. However, it is also possible that the interaction between the abstract representation and its related information e.g. phonological and semantic information was damaged as well in this patient. Taken together, we conclude that: 1) the LMFC-OTS is necessary for Chinese character processing, 2) it is selective for Chinese character processing, and 3) its critical function is to extract multiple levels of abstract representation of visual word and possibly to transmit it to phonological and semantic systems.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Visual object recognition requires the matching of an image with a set of models stored in memory. In this paper we propose an approach to recognition in which a 3-D object is represented by the linear combination of 2-D images of the object. If M = {M1,...Mk} is the set of pictures representing a given object, and P is the 2-D image of an object to be recognized, then P is considered an instance of M if P = Eki=aiMi for some constants ai. We show that this approach handles correctly rigid 3-D transformations of objects with sharp as well as smooth boundaries, and can also handle non-rigid transformations. The paper is divided into two parts. In the first part we show that the variety of views depicting the same object under different transformations can often be expressed as the linear combinations of a small number of views. In the second part we suggest how this linear combinatino property may be used in the recognition process.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Different approaches to visual object recognition can be divided into two general classes: model-based vs. non model-based schemes. In this paper we establish some limitation on the class of non model-based recognition schemes. We show that every function that is invariant to viewing position of all objects is the trivial (constant) function. It follows that every consistent recognition scheme for recognizing all 3-D objects must in general be model based. The result is extended to recognition schemes that are imperfect (allowed to make mistakes) or restricted to certain classes of objects.