887 resultados para Visual Object Recognition


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we present a new approach to visual speech recognition which improves contextual modelling by combining Inter-Frame Dependent and Hidden Markov Models. This approach captures contextual information in visual speech that may be lost using a Hidden Markov Model alone. We apply contextual modelling to a large speaker independent isolated digit recognition task, and compare our approach to two commonly adopted feature based techniques for incorporating speech dynamics. Results are presented from baseline feature based systems and the combined modelling technique. We illustrate that both of these techniques achieve similar levels of performance when used independently. However significant improvements in performance can be achieved through a combination of the two. In particular we report an improvement in excess of 17% relative Word Error Rate in comparison to our best baseline system.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this thesis, three main questions were addressed using event-related potentials (ERPs): (1) the timing of lexical semantic access, (2) the influence of "top-down" processes on visual word processing, and (3) the influence of "bottom-up" factors on visual word processing. The timing of lexical semantic access was investigated in two studies using different designs. In Study 1,14 participants completed two tasks: a standard lexical decision (LD) task which required a word/nonword decision to each target stimulus, and a semantically primed version (LS) of it using the same category of words (e.g., animal) within each block following which participants made a category judgment. In Study 2, another 12 participants performed a standard semantic priming task, where target stimulus words (e.g., nurse) could be either semantically related or unrelated to their primes (e.g., doctor, tree) but the order of presentation was randomized. We found evidence in both ERP studies that lexical semantic access might occur early within the first 200 ms (at about 170 ms for Study 1 and at about 160 ms for Study 2). Our results were consistent with more recent ERP and eye-tracking studies and are in contrast with the traditional research focus on the N400 component. "Top-down" processes, such as a person's expectation and strategic decisions, were possible in Study 1 because of the blocked design, but they were not for Study 2 with a randomized design. Comparing results from two studies, we found that visual word processing could be affected by a person's expectation and the effect occurred early at a sensory/perceptual stage: a semantic task effect in the PI component at about 100 ms in the ERP was found in Study 1 , but not in Study 2. Furthermore, we found that such "top-down" influence on visual word processing might be mediated through separate mechanisms depending on whether the stimulus was a word or a nonword. "Bottom-up" factors involve inherent characteristics of particular words, such as bigram frequency (the total frequency of two-letter combinations of a word), word frequency (the frequency of the written form of a word), and neighborhood density (the number of words that can be generated by changing one letter of an original word or nonword). A bigram frequency effect was found when comparing the results from Studies 1 and 2, but it was examined more closely in Study 3. Fourteen participants performed a similar standard lexical decision task but the words and nonwords were selected systematically to provide a greater range in the aforementioned factors. As a result, a total of 18 word conditions were created with 18 nonword conditions matched on neighborhood density and neighborhood frequency. Using multiple regression analyses, we foimd that the PI amplitude was significantly related to bigram frequency for both words and nonwords, consistent with results from Studies 1 and 2. In addition, word frequency and neighborhood frequency were also able to influence the PI amplitude separately for words and for nonwords and there appeared to be a spatial dissociation between the two effects: for words, the word frequency effect in PI was found at the left electrode site; for nonwords, the neighborhood frequency effect in PI was fovind at the right elecfrode site. The implications of otir findings are discussed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis describes the development of a model-based vision system that exploits hierarchies of both object structure and object scale. The focus of the research is to use these hierarchies to achieve robust recognition based on effective organization and indexing schemes for model libraries. The goal of the system is to recognize parameterized instances of non-rigid model objects contained in a large knowledge base despite the presence of noise and occlusion. Robustness is achieved by developing a system that can recognize viewed objects that are scaled or mirror-image instances of the known models or that contain components sub-parts with different relative scaling, rotation, or translation than in models. The approach taken in this thesis is to develop an object shape representation that incorporates a component sub-part hierarchy- to allow for efficient and correct indexing into an automatically generated model library as well as for relative parameterization among sub-parts, and a scale hierarchy- to allow for a general to specific recognition procedure. After analysis of the issues and inherent tradeoffs in the recognition process, a system is implemented using a representation based on significant contour curvature changes and a recognition engine based on geometric constraints of feature properties. Examples of the system's performance are given, followed by an analysis of the results. In conclusion, the system's benefits and limitations are presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Two formulations of model-based object recognition are described. MAP Model Matching evaluates joint hypotheses of match and pose, while Posterior Marginal Pose Estimation evaluates the pose only. Local search in pose space is carried out with the Expectation--Maximization (EM) algorithm. Recognition experiments are described where the EM algorithm is used to refine and evaluate pose hypotheses in 2D and 3D. Initial hypotheses for the 2D experiments were generated by a simple indexing method: Angle Pair Indexing. The Linear Combination of Views method of Ullman and Basri is employed as the projection model in the 3D experiments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Object recognition is complicated by clutter, occlusion, and sensor error. Since pose hypotheses are based on image feature locations, these effects can lead to false negatives and positives. In a typical recognition algorithm, pose hypotheses are tested against the image, and a score is assigned to each hypothesis. We use a statistical model to determine the score distribution associated with correct and incorrect pose hypotheses, and use binary hypothesis testing techniques to distinguish between them. Using this approach we can compare algorithms and noise models, and automatically choose values for internal system thresholds to minimize the probability of making a mistake.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A fast simulated annealing algorithm is developed for automatic object recognition. The normalized correlation coefficient is used as a measure of the match between a hypothesized object and an image. Templates are generated on-line during the search by transforming model images. Simulated annealing reduces the search time by orders of magnitude with respect to an exhaustive search. The algorithm is applied to the problem of how landmarks, for example, traffic signs, can be recognized by an autonomous vehicle or a navigating robot. The algorithm works well in noisy, real-world images of complicated scenes for model images with high information content.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many 3D objects in the world around us are strongly constrained. For instance, not only cultural artifacts but also many natural objects are bilaterally symmetric. Thoretical arguments suggest and psychophysical experiments confirm that humans may be better in the recognition of symmetric objects. The hypothesis of symmetry-induced virtual views together with a network model that successfully accounts for human recognition of generic 3D objects leads to predictions that we have verified with psychophysical experiments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Local descriptors are increasingly used for the task of object recognition because of their perceived robustness with respect to occlusions and to global geometrical deformations. Such a descriptor--based on a set of oriented Gaussian derivative filters-- is used in our recognition system. We report here an evaluation of several techniques for orientation estimation to achieve rotation invariance of the descriptor. We also describe feature selection based on a single training image. Virtual images are generated by rotating and rescaling the image and robust features are selected. The results confirm robust performance in cluttered scenes, in the presence of partial occlusions, and when the object is embedded in different backgrounds.