977 resultados para visual words
Resumo:
Previous studies have demonstrated that a region in the left ventral occipito-temporal (LvOT) cortex is highly selective to the visual forms of written words and objects relative to closely matched visual stimuli. Here, we investigated why LvOT activation is not higher for reading than picture naming even though written words and pictures of objects have grossly different visual forms. To compare neuronal responses for words and pictures within the same LvOT area, we used functional magnetic resonance imaging adaptation and instructed participants to name target stimuli that followed briefly presented masked primes that were either presented in the same stimulus type as the target (word-word, picture-picture) or a different stimulus type (picture-word, word-picture). We found that activation throughout posterior and anterior parts of LvOT was reduced when the prime had the same name/response as the target irrespective of whether the prime-target relationship was within or between stimulus type. As posterior LvOT is a visual form processing area, and there was no visual form similarity between different stimulus types, we suggest that our results indicate automatic top-down influences from pictures to words and words to pictures. This novel perspective motivates further investigation of the functional properties of this intriguing region.
Resumo:
Different from the first attempts to solve the image categorization problem (often based on global features), recently, several researchers have been tackling this research branch through a new vantage point - using features around locally invariant interest points and visual dictionaries. Although several advances have been done in the visual dictionaries literature in the past few years, a problem we still need to cope with is calculation of the number of representative words in the dictionary. Therefore, in this paper we introduce a new solution for automatically finding the number of visual words in an N-Way image categorization problem by means of supervised pattern classification based on optimum-path forest. © 2011 IEEE.
Resumo:
Generic object recognition is an important function of the human visual system and everybody finds it highly useful in their everyday life. For an artificial vision system it is a really hard, complex and challenging task because instances of the same object category can generate very different images, depending of different variables such as illumination conditions, the pose of an object, the viewpoint of the camera, partial occlusions, and unrelated background clutter. The purpose of this thesis is to develop a system that is able to classify objects in 2D images based on the context, and identify to which category the object belongs to. Given an image, the system can classify it and decide the correct categorie of the object. Furthermore the objective of this thesis is also to test the performance and the precision of different supervised Machine Learning algorithms in this specific task of object image categorization. Through different experiments the implemented application reveals good categorization performances despite the difficulty of the problem. However this project is open to future improvement; it is possible to implement new algorithms that has not been invented yet or using other techniques to extract features to make the system more reliable. This application can be installed inside an embedded system and after trained (performed outside the system), so it can become able to classify objects in a real-time. The information given from a 3D stereocamera, developed inside the department of Computer Engineering of the University of Bologna, can be used to improve the accuracy of the classification task. The idea is to segment a single object in a scene using the depth given from a stereocamera and in this way make the classification more accurate.
Contribuições para a localização e mapeamento em robótica através da identificação visual de lugares
Resumo:
Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2015
Resumo:
Scene classification based on latent Dirichlet allocation (LDA) is a more general modeling method known as a bag of visual words, in which the construction of a visual vocabulary is a crucial quantization process to ensure success of the classification. A framework is developed using the following new aspects: Gaussian mixture clustering for the quantization process, the use of an integrated visual vocabulary (IVV), which is built as the union of all centroids obtained from the separate quantization process of each class, and the usage of some features, including edge orientation histogram, CIELab color moments, and gray-level co-occurrence matrix (GLCM). The experiments are conducted on IKONOS images with six semantic classes (tree, grassland, residential, commercial/industrial, road, and water). The results show that the use of an IVV increases the overall accuracy (OA) by 11 to 12% and 6% when it is implemented on the selected and all features, respectively. The selected features of CIELab color moments and GLCM provide a better OA than the implementation over CIELab color moment or GLCM as individuals. The latter increases the OA by only ∼2 to 3%. Moreover, the results show that the OA of LDA outperforms the OA of C4.5 and naive Bayes tree by ∼20%. © 2014 Society of Photo-Optical Instrumentation Engineers (SPIE) [DOI: 10.1117/1.JRS.8.083690]
Resumo:
Image categorization by means of bag of visual words has received increasing attention by the image processing and vision communities in the last years. In these approaches, each image is represented by invariant points of interest which are mapped to a Hilbert Space representing a visual dictionary which aims at comprising the most discriminative features in a set of images. Notwithstanding, the main problem of such approaches is to find a compact and representative dictionary. Finding such representative dictionary automatically with no user intervention is an even more difficult task. In this paper, we propose a method to automatically find such dictionary by employing a recent developed graph-based clustering algorithm called Optimum-Path Forest, which does not make any assumption about the visual dictionary's size and is more efficient and effective than the state-of-the-art techniques used for dictionary generation. © 2012 IEEE.
Resumo:
Image categorization by means of bag of visual words has received increasing attention by the image processing and vision communities in the last years. In these approaches, each image is represented by invariant points of interest which are mapped to a Hilbert Space representing a visual dictionary which aims at comprising the most discriminative features in a set of images. Notwithstanding, the main problem of such approaches is to find a compact and representative dictionary. Finding such representative dictionary automatically with no user intervention is an even more difficult task. In this paper, we propose a method to automatically find such dictionary by employing a recent developed graph-based clustering algorithm called Optimum-Path Forest, which does not make any assumption about the visual dictionary's size and is more efficient and effective than the state-of-the-art techniques used for dictionary generation.
Resumo:
In two experiments, electric brain waves of 14 subjects were recorded under several different conditions to study the invariance of brain-wave representations of simple patches of colors and simple visual shapes and their names, the words blue, circle, etc. As in our earlier work, the analysis consisted of averaging over trials to create prototypes and test samples, to both of which Fourier transforms were applied, followed by filtering and an inverse transformation to the time domain. A least-squares criterion of fit between prototypes and test samples was used for classification. The most significant results were these. By averaging over different subjects, as well as trials, we created prototypes from brain waves evoked by simple visual images and test samples from brain waves evoked by auditory or visual words naming the visual images. We correctly recognized from 60% to 75% of the test-sample brain waves. The general conclusion is that simple shapes such as circles and single-color displays generate brain waves surprisingly similar to those generated by their verbal names. These results, taken together with extensive psychological studies of auditory and visual memory, strongly support the solution proposed for visual shapes, by Bishop Berkeley and David Hume in the 18th century, to the long-standing problem of how the mind represents simple abstract ideas.
Resumo:
Probabilistic topic models have recently been used for activity analysis in video processing, due to their strong capacity to model both local activities and interactions in crowded scenes. In those applications, a video sequence is divided into a collection of uniform non-overlaping video clips, and the high dimensional continuous inputs are quantized into a bag of discrete visual words. The hard division of video clips, and hard assignment of visual words leads to problems when an activity is split over multiple clips, or the most appropriate visual word for quantization is unclear. In this paper, we propose a novel algorithm, which makes use of a soft histogram technique to compensate for the loss of information in the quantization process; and a soft cut technique in the temporal domain to overcome problems caused by separating an activity into two video clips. In the detection process, we also apply a soft decision strategy to detect unusual events.We show that the proposed soft decision approach outperforms its hard decision counterpart in both local and global activity modelling.
Resumo:
This paper presents an investigation into event detection in crowded scenes, where the event of interest co-occurs with other activities and only binary labels at the clip level are available. The proposed approach incorporates a fast feature descriptor from the MPEG domain, and a novel multiple instance learning (MIL) algorithm using sparse approximation and random sensing. MPEG motion vectors are used to build particle trajectories that represent the motion of objects in uniform video clips, and the MPEG DCT coefficients are used to compute a foreground map to remove background particles. Trajectories are transformed into the Fourier domain, and the Fourier representations are quantized into visual words using the K-Means algorithm. The proposed MIL algorithm models the scene as a linear combination of independent events, where each event is a distribution of visual words. Experimental results show that the proposed approaches achieve promising results for event detection compared to the state-of-the-art.
Resumo:
This paper describes a novel system for automatic classification of images obtained from Anti-Nuclear Antibody (ANA) pathology tests on Human Epithelial type 2 (HEp-2) cells using the Indirect Immunofluorescence (IIF) protocol. The IIF protocol on HEp-2 cells has been the hallmark method to identify the presence of ANAs, due to its high sensitivity and the large range of antigens that can be detected. However, it suffers from numerous shortcomings, such as being subjective as well as time and labour intensive. Computer Aided Diagnostic (CAD) systems have been developed to address these problems, which automatically classify a HEp-2 cell image into one of its known patterns (eg. speckled, homogeneous). Most of the existing CAD systems use handpicked features to represent a HEp-2 cell image, which may only work in limited scenarios. We propose a novel automatic cell image classification method termed Cell Pyramid Matching (CPM), which is comprised of regional histograms of visual words coupled with the Multiple Kernel Learning framework. We present a study of several variations of generating histograms and show the efficacy of the system on two publicly available datasets: the ICPR HEp-2 cell classification contest dataset and the SNPHEp-2 dataset.
Resumo:
In this study we investigate previous claims that a region in the left posterior superior temporal sulcus (pSTS) is more activated by audiovisual than unimodal processing. First, we compare audiovisual to visual-visual and auditory-auditory conceptual matching using auditory or visual object names that are paired with pictures of objects or their environmental sounds. Second, we compare congruent and incongruent audiovisual trials when presentation is simultaneous or sequential. Third, we compare audiovisual stimuli that are either verbal (auditory and visual words) or nonverbal (pictures of objects and their associated sounds). The results demonstrate that, when task, attention, and stimuli are controlled, pSTS activation for audiovisual conceptual matching is 1) identical to that observed for intramodal conceptual matching, 2) greater for incongruent than congruent trials when auditory and visual stimuli are simultaneously presented, and 3) identical for verbal and nonverbal stimuli. These results are not consistent with previous claims that pSTS activation reflects the active formation of an integrated audiovisual representation. After a discussion of the stimulus and task factors that modulate activation, we conclude that, when stimulus input, task, and attention are controlled, pSTS is part of a distributed set of regions involved in conceptual matching, irrespective of whether the stimuli are audiovisual, auditory-auditory or visual-visual.
Resumo:
This paper outlines the approach taken by the Speech, Audio, Image and Video Technologies laboratory, and the Applied Data Mining Research Group (SAIVT-ADMRG) in the 2014 MediaEval Social Event Detection (SED) task. We participated in the event based clustering subtask (subtask 1), and focused on investigating the incorporation of image features as another source of data to aid clustering. In particular, we developed a descriptor based around the use of super-pixel segmentation, that allows a low dimensional feature that incorporates both colour and texture information to be extracted and used within the popular bag-of-visual-words (BoVW) approach.
Resumo:
Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.
Resumo:
According to the influential dual-route model of reading (Coltheart, Rastle et al. 2001), there are two routes to access the meaning of visual words: one directly by orthography (orthography-semantic) and the other indirectly via the phonology (phonology-semantic). Because of the dramatic difference between written Chinese and alphabetical languages, it is still on debate whether Chinese readers have the same semantic activation processes as readers of alphabetical languages. In this study, the semantic activation processes in alphabetical German and logographic Chinese were compared. Since the N450 for incongruent color words in the Stroop tasks was induced by the semantic conflict between the meaning of the incongruent color words and color naming, this component could be taken as an index for semantic activation of incongruent color words in Stroop tasks. Two cross-script Stroop experiments were adopted to investigate the semantic activation processes in Chinese and German. The first experiment focused on the the role of phonology, while the second one focused on the realative importance of orthography. Cultural differences in cognitive processing between individuals in western and eastern countries have been found (Nisbett & Miyamoto, 2005). In order to exclude potential differences in basic cognitive processes like visual discrimination capabilities during reading, a visual Oddball experiment with non-lexical materials was conducted with all participants. However, as indicated by the P300 elicited by deviant stimuli in both groups, no group difference was observed. In the first Stroop experiments, color words (e.g., “green”), color-word associates (e.g., “grass”), and homophones of color words were used. These words were embedded into color patches with either congruent color (e.g. word “green” in green color patch) or incongruent colors (e.g. word “green” in either red or yellow or blue color patch). The key point is to observe whether homophones in both languages could induce similar behavioral and ERP Stroop effects to that induced by color words. It was also interesting to observe to which extent the N450 was related to the semantic conflicts. Nineteen Chinese adult readers and twenty German adult readers were asked to respond to the back color of these words in the Stroop experiment in their native languages by pressing the corresponding keys. In the behavioral data, incongruent conditions (incongruent color words, incongruent color-word associates, incongruent homophones) had significantly longer reaction times as compared to corresponding congruent conditions. All incongruent conditions in the Geman group elicited an N450 in the 400 to 500 ms time window. In the Chinese group, the N450 in the same time window was also observed for the incongruent color words and incongruent color-word associates. These results indicated that the N450 was very sensitive to semantic conflict-even words with semantic association to colors (e.g. “grass”) could elicite similar N450. However, the N450 was absent for incongruent homophones of color words in the Chinese group. Instead, in a later time window (600-800 ms), incongruent homophones elicited a positivity over left posterior regions as compared to congruent homophones. Similar positivity was also observed for color words in the 700 to 1000 ms time window in the Chinese group and 600 to 1000 ms time window for incongruent color words and homophones in the Geman group. These results indicate that phonology plays an important role in Geman semantic activation processes, but not in Chinese. In the second Stroop experiment, color words and pseudowords which had similiar visual shape to color words in both languages were used as materials. Another group of eighteen Chinese and twenty Germans were involved in the Stroop experiment in their native languages.The ERPs were recorded during their performance. In the behavioral data, strong and comparable Stroop effects (as counted by substract the reaction times in the congruent conditions from reaction times in the incongruent conditions) were observed. In the ERP data, both incongruent color words and incongruent pseudowords elicited an N450 over the whole brain scalp in both groups. These results indicated that orthography played an equally important role in semantic activation processes in both languages. The results of the two Stroop experiments support the view that the semantic activation process in Chiense readers differs significantly from that in German readers. The former rely mainly on the direct route (orthography-semantic), while the latter use both direct route and incirect route (phonology-semantic). These findings also indicate that the characteritics of different languages shape the semantic activation processes.