Deep neural networks have recently gained popularity for improv- ing state-of-the-art machine learning algorithms in diverse areas such as speech recognition, computer vision and bioinformatics. Convolutional networks especially have shown prowess in visual recognition tasks such as object recognition and detection in which this work is focused on. Mod- ern award-winning architectures have systematically surpassed previous attempts at tackling computer vision problems and keep winning most current competitions. After a brief study of deep learning architectures and readily available frameworks and libraries, the LeNet handwriting digit recognition network study case is developed, and lastly a deep learn- ing network for playing simple videogames is reviewed.


This paper presents an incremental learning solution for Linear Discriminant Analysis (LDA) and its applications to object recognition problems. We apply the sufficient spanning set approximation in three steps i.e. update for the total scatter matrix, between-class scatter matrix and the projected data matrix, which leads an online solution which closely agrees with the batch solution in accuracy while significantly reducing the computational complexity. The algorithm yields an efficient solution to incremental LDA even when the number of classes as well as the set size is large. The incremental LDA method has been also shown useful for semi-supervised online learning. Label propagation is done by integrating the incremental LDA into an EM framework. The method has been demonstrated in the task of merging large datasets which were collected during MPEG standardization for face image retrieval, face authentication using the BANCA dataset, and object categorisation using the Caltech101 dataset. © 2010 Springer Science+Business Media, LLC.


Global information is considered the primitive of visual perception in Gestalt psychology. Further, L. Chen ( 2005) proposed a new theory of topological visual perception. According to this theory, the perception of topological difference is faster than o


A visual target is more difficult to recognize when it is surrounded by other, similar objects. This breakdown in object recognition is known as crowding. Despite a long history of experimental work, computational models of crowding are still sparse. Specifically, few studies have examined crowding using an ideal-observer approach. Here, we compare crowding in ideal observers with crowding in humans. We derived an ideal-observer model for target identification under conditions of position and identity uncertainty. Simulations showed that this model reproduces the hallmark of crowding, namely a critical spacing that scales with viewing eccentricity. To examine how well the model fits quantitatively to human data, we performed three experiments. In Experiments 1 and 2, we measured observers' perceptual uncertainty about stimulus positions and identities, respectively, for a target in isolation. In Experiment 3, observers identified a target that was flanked by two distractors. We found that about half of the errors in Experiment 3 could be accounted for by the perceptual uncertainty measured in Experiments 1 and 2. The remainder of the errors could be accounted for by assuming that uncertainty (i.e., the width of internal noise distribution) about stimulus positions and identities depends on flanker proximity. Our results provide a mathematical restatement of the crowding problem and support the hypothesis that crowding behavior is a sign of optimality rather than a perceptual defect.


An object in the peripheral visual field is more difficult to recognize when surrounded by other objects. This phenomenon is called "crowding". Crowding places a fundamental constraint on human vision that limits performance on numerous tasks. It has been suggested that crowding results from spatial feature integration necessary for object recognition. However, in the absence of convincing models, this theory has remained controversial. Here, we present a quantitative and physiologically plausible model for spatial integration of orientation signals, based on the principles of population coding. Using simulations, we demonstrate that this model coherently accounts for fundamental properties of crowding, including critical spacing, "compulsory averaging", and a foveal-peripheral anisotropy. Moreover, we show that the model predicts increased responses to correlated visual stimuli. Altogether, these results suggest that crowding has little immediate bearing on object recognition but is a by-product of a general, elementary integration mechanism in early vision aimed at improving signal quality.


The visual system must learn to infer the presence of objects and features in the world from the images it encounters, and as such it must, either implicitly or explicitly, model the way these elements interact to create the image. Do the response properties of cells in the mammalian visual system reflect this constraint? To address this question, we constructed a probabilistic model in which the identity and attributes of simple visual elements were represented explicitly and learnt the parameters of this model from unparsed, natural video sequences. After learning, the behaviour and grouping of variables in the probabilistic model corresponded closely to functional and anatomical properties of simple and complex cells in the primary visual cortex (V1). In particular, feature identity variables were activated in a way that resembled the activity of complex cells, while feature attribute variables responded much like simple cells. Furthermore, the grouping of the attributes within the model closely parallelled the reported anatomical grouping of simple cells in cat V1. Thus, this generative model makes explicit an interpretation of complex and simple cells as elements in the segmentation of a visual scene into basic independent features, along with a parametrisation of their moment-by-moment appearances. We speculate that such a segmentation may form the initial stage of a hierarchical system that progressively separates the identity and appearance of more articulated visual elements, culminating in view-invariant object recognition.


Traditional approaches to upper body pose estimation using monocular vision rely on complex body models and a large variety of geometric constraints. We argue that this is not ideal and somewhat inelegant as it results in large processing burdens, and instead attempt to incorporate these constraints through priors obtained directly from training data. A prior distribution covering the probability of a human pose occurring is used to incorporate likely human poses. This distribution is obtained offline, by fitting a Gaussian mixture model to a large dataset of recorded human body poses, tracked using a Kinect sensor. We combine this prior information with a random walk transition model to obtain an upper body model, suitable for use within a recursive Bayesian filtering framework. Our model can be viewed as a mixture of discrete Ornstein-Uhlenbeck processes, in that states behave as random walks, but drift towards a set of typically observed poses. This model is combined with measurements of the human head and hand positions, using recursive Bayesian estimation to incorporate temporal information. Measurements are obtained using face detection and a simple skin colour hand detector, trained using the detected face. The suggested model is designed with analytical tractability in mind and we show that the pose tracking can be Rao-Blackwellised using the mixture Kalman filter, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. In addition, the use of the proposed upper body model allows reliable three-dimensional pose estimates to be obtained indirectly for a number of joints that are often difficult to detect using traditional object recognition strategies. Comparisons with Kinect sensor results and the state of the art in 2D pose estimation highlight the efficacy of the proposed approach.


目标识别技术在现实生活中的很多领域都有广泛的应用,但是由于遮挡,视角变换等因素的影响,目标识别技术仍面临着巨大的挑战。局部特征由于其本身固有的局部性,引起了人们的重视。结合空间分布约束,局部特征可以包含高层的语义信息,能够提高目标识别算法抗遮挡和视角变化的能力。本文分析对比了当前流行的局部特征检测方法,描述方法以及空间分布约束方法,并提出了一种“中心-特征”结构模型以及相应的目标识别方法。 首先介绍局部特征检测方法,深入研究局部特征描述方法,并从原理,不变性,匹配速度,适用情形等方面进行了比较分析。 综合显式模型和隐式模型的优缺点,提出了一种“中心-特征”结构的模型。该模型以目标中心作为衡量所有局部特征之间位置关系的参考点,既保留了星形模型等的准确性,同时又去掉了特殊结点,避免了特殊结点缺失带来的不利影响,提高了算法的稳定性。 基于上述空间分布约束模型提出了相应的目标识别算法。该算法同时考虑表面特征和空间位置之间的匹配程度。基于模板中目标的表面特征和形状因素构造空间分布约束模型,利用待检测目标的表面特征信息形成相关假设,通过假设检验定量衡量目标出现的位置及可能性,并提出了一种搜索目标中心位置的加速算法。实验验证了算法在相似变换及仿射变换下的有效性,且具有一定的抗缺失能力。


光照是影响成像的关键因素之一。当光照条件变化时,同一物体的不同成像之间的差异极大,有时甚至大于不同物体的成像之间的差异。在很多目标识别应用场景中,光照又常常不受人为控制,这使得光照变化条件下的目标识别成为一个普遍而具有挑战性的问题。 本文深入分析了光照特性如强度、方向和颜色等的改变对目标成像的影响;研究了目前流行的各种光照鲁棒的目标识别方法,介绍它们的算法原理,分析光照鲁棒的原因,算法的适用条件等。 提出了一种在低照度条件下基于图像频域特征的目标识别方法,该方法通过分析空频域仿射变换之间的关系,采取对梯度图像的傅氏频谱进行伪对数采样的特征提取方法,较好地提取了中低频特征,抑制了高频噪声,避免了光照变化带来的不利影响;使用神经网络进行识别,有效地提取了目标的仿射不变特征,识别速度快。 提出了一种光照鲁棒的非线性相关目标识别方法。该方法采取一种信息分解的策略,将灰度信息分解为描述存在变化的区域和区域内变化程度两个描述分量,选择比较有区分力的部分像素参与匹配;以向量之间夹角的大小作为相似度度量,直接利用图像的灰度信息,在高维向量空间中考虑图像之间的相似度,克服了在低照度、低信噪比的图像中求边缘、角点和形状等特征时面临的困难。该相似度度量不受向量模的大小(乘性光照变化)以及向量平移(加性光照变化)的影响,是线性光照不变的。


According to the research results reported in the past decades, it is well acknowledged that face recognition is not a trivial task. With the development of electronic devices, we are gradually revealing the secret of object recognition in the primate's visual cortex. Therefore, it is time to reconsider face recognition by using biologically inspired features. In this paper, we represent face images by utilizing the C1 units, which correspond to complex cells in the visual cortex, and pool over S1 units by using a maximum operation to reserve only the maximum response of each local area of S1 units. The new representation is termed C1 Face. Because C1 Face is naturally a third-order tensor (or a three dimensional array), we propose three-way discriminative locality alignment (TWDLA), an extension of the discriminative locality alignment, which is a top-level discriminate manifold learning-based subspace learning algorithm. TWDLA has the following advantages: (1) it takes third-order tensors as input directly so the structure information can be well preserved; (2) it models the local geometry over every modality of the input tensors so the spatial relations of input tensors within a class can be preserved; (3) it maximizes the margin between a tensor and tensors from other classes over each modality so it performs well for recognition tasks and (4) it has no under sampling problem. Extensive experiments on YALE and FERET datasets show (1) the proposed C1Face representation can better represent face images than raw pixels and (2) TWDLA can duly preserve both the local geometry and the discriminative information over every modality for recognition.


随着移动机器人应用范围的日益扩展,在动态、非结构化环境下提高其自主导航能力已经成为移动机器人研究领域迫切需要解决的问题。在机器人自主导航关键技术中,识别技术是最难解决、也是最急需解决的问题。视觉作为导航中的重要传感器,与其他传感器相比具有信息量大、重量轻便、功耗低等诸多优势,因此基于视觉的识别技术也被公认为最具潜力的研究方向。 本文以国防基础研究项目和中科院开放实验室基金项目为依托,以沈阳自动化所自主研发的“轮腿复合结构机器人”和“无人机”为实验平台,针对地面自主机器人和无人机自主导航中迫切需要解决的应用问题,有针对性的展开研究,旨在提高移动机器人在动态、非结构化环境下的适应能力。 本论文的主要内容如下: 首先,为了提高复杂环境下地面移动机器人的自主能力,本文提出了一种基于立体视觉的面向室外非结构化环境障碍物检测算法。文中首先给出了一种可以从V视差图(V-disparity image)中有效估计地面主视差(Main Ground Disparity, MGD)的方法。随后,我们利用由粗到精逐步判断的方式,来识别疑似障碍和最终障碍并对障碍进行定位。最后,该方法已在地面自主移动平台得到实际应用。通过在各种场景下的实验,验证了该方法的准确性和快速性。 其次,以无人机天际线识别为背景,提出了一种准确、实时的天际线识别算法,并由此估计姿态角。通过对天际线建立能量泛函模型,利用变分原理推出相应偏微分方程。在实际应用中出于对实时性的考虑,引入分段直线约束对该模型进行简化,然后利用由粗到精的思想识别天际线。具体做法是:首先,对图像预处理并垂直剖分,然后利用简化的水平直线模型对天际线进行粗识别,通过拟合获得天际线粗识别结果,最后在基于梯度和区域混合开曲线模型约束下精确识别天际线,并由此估计无人机滚动和俯仰姿态角。 第三,通过对红外机场跑道的目标特性进行分析,文中设计了一种新的基于1D Haar 小波的并行的红外图像分割算法的;然后,有针对性的对分割区域提取特征;最后,两种常用的识别方法,支持向量机(SVM)和投票法(Voting)被用于对疑似目标区域进行分类和识别。通过对实际视频和红外仿真图片的测试,验证了本文算法的快速性、可靠性和实时性,该算法每帧平均处理时间为30ms。 最后,针对无人机空中巡逻中对人群进行自动监控所遇到的问题,通过将此类问题简化为固定视角下人流密度监测问题,提出了一种全新的基于速度场估计的越线人流计数和区域内人流密度估计算法。 首先,该算法把越线的人流当成运动的流场,给出了一种有效估计1D速度场的运动估计模型;然后,通过对动态人流进行速度估计和积分,将越线人流的拼接成动态区域;最后,对各个动态区域提取面积和边缘信息,利用回归分析实现对人流密度估计。该方法与以往基于场景学习的方法不同,本文是一种基于角度的学习,因此便于实际应用。


Crowding, generally defined as the deleterious influence of nearby contours on visual discrimination, is ubiquitous in spatial vision. Specifically, long-range effects of non-overlapping distracters can alter the appearance of an object, making it unrecognizable. Theories in many domains, including vision computation and high-level attention, have been proposed to account for crowding. However, neither compulsory averaging model nor insufficient spatial esolution of attention provides an adequate explanation for crowding. The present study examined the effects of perceptual organization on crowding. We hypothesize that target-distractor segmentation in crowding is analogous to figure-ground segregation in Gestalt. When distractors can be grouped as a whole or when they are similar to each other but different from the target, the target can be distinguished from distractors. However, grouping target and distractors together by Gestalt principles may interfere with target-distractor separation. Six experiments were carried out to assess our theory. In experiments 1, 2, and 3, we manipulated the similarity between target and distractor as well as the configuration of distractors to investigate the effects of stimuli-driven grouping on target-distractor segmentation. In experiments 4, 5, and 6, we focused on the interaction between bottom-up and top-down processes of grouping, and their influences on target-distractor segmentation. Our results demonstrated that: (a) when distractors were similar to each other but different from target, crowding was eased; (b) when distractors formed a subjective contour or were placed regularly, crowding was also reduced; (c) both bottom-up and top-down processes could influence target-distractor grouping, mediating the effects of crowding. These results support our hypothesis that the figure-ground segregation and target-distractor segmentation in crowding may share similar processes. The present study not only provides a novel explanation for crowding, but also examines the processing bottleneck in object recognition. These findings have significant implications on computer vision and interface design as well as on clinical practice in amblyopia and dyslexia.


A common design of an object recognition system has two steps, a detection step followed by a foreground within-class classification step. For example, consider face detection by a boosted cascade of detectors followed by face ID recognition via one-vs-all (OVA) classifiers. Another example is human detection followed by pose recognition. Although the detection step can be quite fast, the foreground within-class classification process can be slow and becomes a bottleneck. In this work, we formulate a filter-and-refine scheme, where the binary outputs of the weak classifiers in a boosted detector are used to identify a small number of candidate foreground state hypotheses quickly via Hamming distance or weighted Hamming distance. The approach is evaluated in three applications: face recognition on the FRGC V2 data set, hand shape detection and parameter estimation on a hand data set and vehicle detection and view angle estimation on a multi-view vehicle data set. On all data sets, our approach has comparable accuracy and is at least five times faster than the brute force approach.


A method for deformable shape detection and recognition is described. Deformable shape templates are used to partition the image into a globally consistent interpretation, determined in part by the minimum description length principle. Statistical shape models enforce the prior probabilities on global, parametric deformations for each object class. Once trained, the system autonomously segments deformed shapes from the background, while not merging them with adjacent objects or shadows. The formulation can be used to group image regions based on any image homogeneity predicate; e.g., texture, color, or motion. The recovered shape models can be used directly in object recognition. Experiments with color imagery are reported.