915 resultados para Computer vision syndrome
Resumo:
Increasing the size of training data in many computer vision tasks has shown to be very effective. Using large scale image datasets (e.g. ImageNet) with simple learning techniques (e.g. linear classifiers) one can achieve state-of-the-art performance in object recognition compared to sophisticated learning techniques on smaller image sets. Semantic search on visual data has become very popular. There are billions of images on the internet and the number is increasing every day. Dealing with large scale image sets is intense per se. They take a significant amount of memory that makes it impossible to process the images with complex algorithms on single CPU machines. Finding an efficient image representation can be a key to attack this problem. A representation being efficient is not enough for image understanding. It should be comprehensive and rich in carrying semantic information. In this proposal we develop an approach to computing binary codes that provide a rich and efficient image representation. We demonstrate several tasks in which binary features can be very effective. We show how binary features can speed up large scale image classification. We present learning techniques to learn the binary features from supervised image set (With different types of semantic supervision; class labels, textual descriptions). We propose several problems that are very important in finding and using efficient image representation.
Resumo:
A cor da superfície dos alimentos é o primeiro parâmetro de qualidade avaliado pelos consumidores, e é critico para a aceitação do produto, então a medição adequada da cor é uma importante ferramenta. Nesta pesquisa avaliou-se a variação da cor em corvina (Micropogonias furnieri) armazenada em gelo durante 16 dias; os parâmetros de luminosidade (L*), valor cromático a*, valor cromático b*, variação total da cor (ΔE) e croma (C*) foram obtidos por sistema de visão computacional, e por colorímetro Konica Minolta CR-400. O frescor da corvina baseada nas mudanças da cor das brânquias foi avaliado utilizando um sistema de visão computacional. Também se modelou a oxidação da mioglobina em files de burriquete (Pogonias cromis), utilizando os parâmetros de vermelho (valor a* e R). Para registrar as mudanças da cor durante 57,6 h utilizou-se um sistema de visão computacional, a análise química realizou-se determinando a concentração de metamioglobina (%). Na avaliação da cor de corvina armazenada em gelo, o sistema de visão computacional mostrou diferenças significativas para L*, a*, ΔE e C*, enquanto que o colorímetro mostrou diferenças significativas para L* e ΔE, o único parâmetro que não apresentou diferenças entre instrumentos foi ΔE durante a avaliação da corvina armazenada em gelo. O coeficiente de correlação entre os parâmetros da cor (L*, a* e b*) das brânquias da corvina armazenada em gelo pelo tempo de armazenamento foi de 0,9747. O sistema de visão computacional registrou as mudanças da cor em filés de burriquete e se modelaram as mudanças utilizando um modelo exponencial. O sistema de visão computacional mostrou ser mais sensível às mudanças da cor durante a avaliação da cor na corvina armazenada em gelo. É possível prognosticar o tempo de armazenamento da corvina em gelo em função da mudança da cor das brânquias. Assim, foi possível modelar a variação da mioglobina em filés de burriquete utilizando sistemas de visão computacional para registrar ditas mudanças. Os sistemas de visão computacional têm grande capacidade para registrar as mudanças da cor e é possível utiliza-los para avaliar os alimentos em função da cor.
Resumo:
Visual recognition is a fundamental research topic in computer vision. This dissertation explores datasets, features, learning, and models used for visual recognition. In order to train visual models and evaluate different recognition algorithms, this dissertation develops an approach to collect object image datasets on web pages using an analysis of text around the image and of image appearance. This method exploits established online knowledge resources (Wikipedia pages for text; Flickr and Caltech data sets for images). The resources provide rich text and object appearance information. This dissertation describes results on two datasets. The first is Berg’s collection of 10 animal categories; on this dataset, we significantly outperform previous approaches. On an additional set of 5 categories, experimental results show the effectiveness of the method. Images are represented as features for visual recognition. This dissertation introduces a text-based image feature and demonstrates that it consistently improves performance on hard object classification problems. The feature is built using an auxiliary dataset of images annotated with tags, downloaded from the Internet. Image tags are noisy. The method obtains the text features of an unannotated image from the tags of its k-nearest neighbors in this auxiliary collection. A visual classifier presented with an object viewed under novel circumstances (say, a new viewing direction) must rely on its visual examples. This text feature may not change, because the auxiliary dataset likely contains a similar picture. While the tags associated with images are noisy, they are more stable when appearance changes. The performance of this feature is tested using PASCAL VOC 2006 and 2007 datasets. This feature performs well; it consistently improves the performance of visual object classifiers, and is particularly effective when the training dataset is small. With more and more collected training data, computational cost becomes a bottleneck, especially when training sophisticated classifiers such as kernelized SVM. This dissertation proposes a fast training algorithm called Stochastic Intersection Kernel Machine (SIKMA). This proposed training method will be useful for many vision problems, as it can produce a kernel classifier that is more accurate than a linear classifier, and can be trained on tens of thousands of examples in two minutes. It processes training examples one by one in a sequence, so memory cost is no longer the bottleneck to process large scale datasets. This dissertation applies this approach to train classifiers of Flickr groups with many group training examples. The resulting Flickr group prediction scores can be used to measure image similarity between two images. Experimental results on the Corel dataset and a PASCAL VOC dataset show the learned Flickr features perform better on image matching, retrieval, and classification than conventional visual features. Visual models are usually trained to best separate positive and negative training examples. However, when recognizing a large number of object categories, there may not be enough training examples for most objects, due to the intrinsic long-tailed distribution of objects in the real world. This dissertation proposes an approach to use comparative object similarity. The key insight is that, given a set of object categories which are similar and a set of categories which are dissimilar, a good object model should respond more strongly to examples from similar categories than to examples from dissimilar categories. This dissertation develops a regularized kernel machine algorithm to use this category dependent similarity regularization. Experiments on hundreds of categories show that our method can make significant improvement for categories with few or even no positive examples.
Resumo:
One of the most significant research topics in computer vision is object detection. Most of the reported object detection results localise the detected object within a bounding box, but do not explicitly label the edge contours of the object. Since object contours provide a fundamental diagnostic of object shape, some researchers have initiated work on linear contour feature representations for object detection and localisation. However, linear contour feature-based localisation is highly dependent on the performance of linear contour detection within natural images, and this can be perturbed significantly by a cluttered background. In addition, the conventional approach to achieving rotation-invariant features is to rotate the feature receptive field to align with the local dominant orientation before computing the feature representation. Grid resampling after rotation adds extra computational cost and increases the total time consumption for computing the feature descriptor. Though it is not an expensive process if using current computers, it is appreciated that if each step of the implementation is faster to compute especially when the number of local features is increasing and the application is implemented on resource limited ”smart devices”, such as mobile phones, in real-time. Motivated by the above issues, a 2D object localisation system is proposed in this thesis that matches features of edge contour points, which is an alternative method that takes advantage of the shape information for object localisation. This is inspired by edge contour points comprising the basic components of shape contours. In addition, edge point detection is usually simpler to achieve than linear edge contour detection. Therefore, the proposed localization system could avoid the need for linear contour detection and reduce the pathological disruption from the image background. Moreover, since natural images usually comprise many more edge contour points than interest points (i.e. corner points), we also propose new methods to generate rotation-invariant local feature descriptors without pre-rotating the feature receptive field to improve the computational efficiency of the whole system. In detail, the 2D object localisation system is achieved by matching edge contour points features in a constrained search area based on the initial pose-estimate produced by a prior object detection process. The local feature descriptor obtains rotation invariance by making use of rotational symmetry of the hexagonal structure. Therefore, a set of local feature descriptors is proposed based on the hierarchically hexagonal grouping structure. Ultimately, the 2D object localisation system achieves a very promising performance based on matching the proposed features of edge contour points with the mean correct labelling rate of the edge contour points 0.8654 and the mean false labelling rate 0.0314 applied on the data from Amsterdam Library of Object Images (ALOI). Furthermore, the proposed descriptors are evaluated by comparing to the state-of-the-art descriptors and achieve competitive performances in terms of pose estimate with around half-pixel pose error.
Resumo:
International audience
Resumo:
Hand detection on images has important applications on person activities recognition. This thesis focuses on PASCAL Visual Object Classes (VOC) system for hand detection. VOC has become a popular system for object detection, based on twenty common objects, and has been released with a successful deformable parts model in VOC2007. A hand detection on an image is made when the system gets a bounding box which overlaps with at least 50% of any ground truth bounding box for a hand on the image. The initial average precision of this detector is around 0.215 compared with a state-of-art of 0.104; however, color and frequency features for detected bounding boxes contain important information for re-scoring, and the average precision can be improved to 0.218 with these features. Results show that these features help on getting higher precision for low recall, even though the average precision is similar.
Resumo:
Humans have a high ability to extract visual data information acquired by sight. Trought a learning process, which starts at birth and continues throughout life, image interpretation becomes almost instinctively. At a glance, one can easily describe a scene with reasonable precision, naming its main components. Usually, this is done by extracting low-level features such as edges, shapes and textures, and associanting them to high level meanings. In this way, a semantic description of the scene is done. An example of this, is the human capacity to recognize and describe other people physical and behavioral characteristics, or biometrics. Soft-biometrics also represents inherent characteristics of human body and behaviour, but do not allow unique person identification. Computer vision area aims to develop methods capable of performing visual interpretation with performance similar to humans. This thesis aims to propose computer vison methods which allows high level information extraction from images in the form of soft biometrics. This problem is approached in two ways, unsupervised and supervised learning methods. The first seeks to group images via an automatic feature extraction learning , using both convolution techniques, evolutionary computing and clustering. In this approach employed images contains faces and people. Second approach employs convolutional neural networks, which have the ability to operate on raw images, learning both feature extraction and classification processes. Here, images are classified according to gender and clothes, divided into upper and lower parts of human body. First approach, when tested with different image datasets obtained an accuracy of approximately 80% for faces and non-faces and 70% for people and non-person. The second tested using images and videos, obtained an accuracy of about 70% for gender, 80% to the upper clothes and 90% to lower clothes. The results of these case studies, show that proposed methods are promising, allowing the realization of automatic high level information image annotation. This opens possibilities for development of applications in diverse areas such as content-based image and video search and automatica video survaillance, reducing human effort in the task of manual annotation and monitoring.
Resumo:
The purpose of this work is to demonstrate and to assess a simple algorithm for automatic estimation of the most salient region in an image, that have possible application in computer vision. The algorithm uses the connection between color dissimilarities in the image and the image’s most salient region. The algorithm also avoids using image priors. Pixel dissimilarity is an informal function of the distance of a specific pixel’s color to other pixels’ colors in an image. We examine the relation between pixel color dissimilarity and salient region detection on the MSRA1K image dataset. We propose a simple algorithm for salient region detection through random pixel color dissimilarity. We define dissimilarity by accumulating the distance between each pixel and a sample of n other random pixels, in the CIELAB color space. An important result is that random dissimilarity between each pixel and just another pixel (n = 1) is enough to create adequate saliency maps when combined with median filter, with competitive average performance if compared with other related methods in the saliency detection research field. The assessment was performed by means of precision-recall curves. This idea is inspired on the human attention mechanism that is able to choose few specific regions to focus on, a biological system that the computer vision community aims to emulate. We also review some of the history on this topic of selective attention.
Resumo:
The research described in this thesis was motivated by the need of a robust model capable of representing 3D data obtained with 3D sensors, which are inherently noisy. In addition, time constraints have to be considered as these sensors are capable of providing a 3D data stream in real time. This thesis proposed the use of Self-Organizing Maps (SOMs) as a 3D representation model. In particular, we proposed the use of the Growing Neural Gas (GNG) network, which has been successfully used for clustering, pattern recognition and topology representation of multi-dimensional data. Until now, Self-Organizing Maps have been primarily computed offline and their application in 3D data has mainly focused on free noise models, without considering time constraints. It is proposed a hardware implementation leveraging the computing power of modern GPUs, which takes advantage of a new paradigm coined as General-Purpose Computing on Graphics Processing Units (GPGPU). The proposed methods were applied to different problem and applications in the area of computer vision such as the recognition and localization of objects, visual surveillance or 3D reconstruction.
Resumo:
Image (Video) retrieval is an interesting problem of retrieving images (videos) similar to the query. Images (Videos) are represented in an input (feature) space and similar images (videos) are obtained by finding nearest neighbors in the input representation space. Numerous input representations both in real valued and binary space have been proposed for conducting faster retrieval. In this thesis, we present techniques that obtain improved input representations for retrieval in both supervised and unsupervised settings for images and videos. Supervised retrieval is a well known problem of retrieving same class images of the query. We address the practical aspects of achieving faster retrieval with binary codes as input representations for the supervised setting in the first part, where binary codes are used as addresses into hash tables. In practice, using binary codes as addresses does not guarantee fast retrieval, as similar images are not mapped to the same binary code (address). We address this problem by presenting an efficient supervised hashing (binary encoding) method that aims to explicitly map all the images of the same class ideally to a unique binary code. We refer to the binary codes of the images as `Semantic Binary Codes' and the unique code for all same class images as `Class Binary Code'. We also propose a new class based Hamming metric that dramatically reduces the retrieval times for larger databases, where only hamming distance is computed to the class binary codes. We also propose a Deep semantic binary code model, by replacing the output layer of a popular convolutional Neural Network (AlexNet) with the class binary codes and show that the hashing functions learned in this way outperforms the state of the art, and at the same time provide fast retrieval times. In the second part, we also address the problem of supervised retrieval by taking into account the relationship between classes. For a given query image, we want to retrieve images that preserve the relative order i.e. we want to retrieve all same class images first and then, the related classes images before different class images. We learn such relationship aware binary codes by minimizing the similarity between inner product of the binary codes and the similarity between the classes. We calculate the similarity between classes using output embedding vectors, which are vector representations of classes. Our method deviates from the other supervised binary encoding schemes as it is the first to use output embeddings for learning hashing functions. We also introduce new performance metrics that take into account the related class retrieval results and show significant gains over the state of the art. High Dimensional descriptors like Fisher Vectors or Vector of Locally Aggregated Descriptors have shown to improve the performance of many computer vision applications including retrieval. In the third part, we will discuss an unsupervised technique for compressing high dimensional vectors into high dimensional binary codes, to reduce storage complexity. In this approach, we deviate from adopting traditional hyperplane hashing functions and instead learn hyperspherical hashing functions. The proposed method overcomes the computational challenges of directly applying the spherical hashing algorithm that is intractable for compressing high dimensional vectors. A practical hierarchical model that utilizes divide and conquer techniques using the Random Select and Adjust (RSA) procedure to compress such high dimensional vectors is presented. We show that our proposed high dimensional binary codes outperform the binary codes obtained using traditional hyperplane methods for higher compression ratios. In the last part of the thesis, we propose a retrieval based solution to the Zero shot event classification problem - a setting where no training videos are available for the event. To do this, we learn a generic set of concept detectors and represent both videos and query events in the concept space. We then compute similarity between the query event and the video in the concept space and videos similar to the query event are classified as the videos belonging to the event. We show that we significantly boost the performance using concept features from other modalities.
Resumo:
Presentaciones de la asignatura Interfaces para Entornos Inteligentes del Máster en Tecnologías de la Informática/Machine Learning and Data Mining.
Resumo:
International audience
Resumo:
A computer vision system that has to interact in natural language needs to understand the visual appearance of interactions between objects along with the appearance of objects themselves. Relationships between objects are frequently mentioned in queries of tasks like semantic image retrieval, image captioning, visual question answering and natural language object detection. Hence, it is essential to model context between objects for solving these tasks. In the first part of this thesis, we present a technique for detecting an object mentioned in a natural language query. Specifically, we work with referring expressions which are sentences that identify a particular object instance in an image. In many referring expressions, an object is described in relation to another object using prepositions, comparative adjectives, action verbs etc. Our proposed technique can identify both the referred object and the context object mentioned in such expressions. Context is also useful for incrementally understanding scenes and videos. In the second part of this thesis, we propose techniques for searching for objects in an image and events in a video. Our proposed incremental algorithms use the context from previously explored regions to prioritize the regions to explore next. The advantage of incremental understanding is restricting the amount of computation time and/or resources spent for various detection tasks. Our first proposed technique shows how to learn context in indoor scenes in an implicit manner and use it for searching for objects. The second technique shows how explicitly written context rules of one-on-one basketball can be used to sequentially detect events in a game.
Resumo:
International audience