31 resultados para Visual Speaker Recognition, Visual Speech Recognition, Cascading Appearance-Based Features
Resumo:
La última década ha sido testigo de importantes avances en el campo de la tecnología de reconocimiento de voz. Los sistemas comerciales existentes actualmente poseen la capacidad de reconocer habla continua de múltiples locutores, consiguiendo valores aceptables de error, y sin la necesidad de realizar procedimientos explícitos de adaptación. A pesar del buen momento que vive esta tecnología, el reconocimiento de voz dista de ser un problema resuelto. La mayoría de estos sistemas de reconocimiento se ajustan a dominios particulares y su eficacia depende de manera significativa, entre otros muchos aspectos, de la similitud que exista entre el modelo de lenguaje utilizado y la tarea específica para la cual se está empleando. Esta dependencia cobra aún más importancia en aquellos escenarios en los cuales las propiedades estadísticas del lenguaje varían a lo largo del tiempo, como por ejemplo, en dominios de aplicación que involucren habla espontánea y múltiples temáticas. En los últimos años se ha evidenciado un constante esfuerzo por mejorar los sistemas de reconocimiento para tales dominios. Esto se ha hecho, entre otros muchos enfoques, a través de técnicas automáticas de adaptación. Estas técnicas son aplicadas a sistemas ya existentes, dado que exportar el sistema a una nueva tarea o dominio puede requerir tiempo a la vez que resultar costoso. Las técnicas de adaptación requieren fuentes adicionales de información, y en este sentido, el lenguaje hablado puede aportar algunas de ellas. El habla no sólo transmite un mensaje, también transmite información acerca del contexto en el cual se desarrolla la comunicación hablada (e.g. acerca del tema sobre el cual se está hablando). Por tanto, cuando nos comunicamos a través del habla, es posible identificar los elementos del lenguaje que caracterizan el contexto, y al mismo tiempo, rastrear los cambios que ocurren en estos elementos a lo largo del tiempo. Esta información podría ser capturada y aprovechada por medio de técnicas de recuperación de información (information retrieval) y de aprendizaje de máquina (machine learning). Esto podría permitirnos, dentro del desarrollo de mejores sistemas automáticos de reconocimiento de voz, mejorar la adaptación de modelos del lenguaje a las condiciones del contexto, y por tanto, robustecer al sistema de reconocimiento en dominios con condiciones variables (tales como variaciones potenciales en el vocabulario, el estilo y la temática). En este sentido, la principal contribución de esta Tesis es la propuesta y evaluación de un marco de contextualización motivado por el análisis temático y basado en la adaptación dinámica y no supervisada de modelos de lenguaje para el robustecimiento de un sistema automático de reconocimiento de voz. Esta adaptación toma como base distintos enfoque de los sistemas mencionados (de recuperación de información y aprendizaje de máquina) mediante los cuales buscamos identificar las temáticas sobre las cuales se está hablando en una grabación de audio. Dicha identificación, por lo tanto, permite realizar una adaptación del modelo de lenguaje de acuerdo a las condiciones del contexto. El marco de contextualización propuesto se puede dividir en dos sistemas principales: un sistema de identificación de temática y un sistema de adaptación dinámica de modelos de lenguaje. Esta Tesis puede describirse en detalle desde la perspectiva de las contribuciones particulares realizadas en cada uno de los campos que componen el marco propuesto: _ En lo referente al sistema de identificación de temática, nos hemos enfocado en aportar mejoras a las técnicas de pre-procesamiento de documentos, asimismo en contribuir a la definición de criterios más robustos para la selección de index-terms. – La eficiencia de los sistemas basados tanto en técnicas de recuperación de información como en técnicas de aprendizaje de máquina, y específicamente de aquellos sistemas que particularizan en la tarea de identificación de temática, depende, en gran medida, de los mecanismos de preprocesamiento que se aplican a los documentos. Entre las múltiples operaciones que hacen parte de un esquema de preprocesamiento, la selección adecuada de los términos de indexado (index-terms) es crucial para establecer relaciones semánticas y conceptuales entre los términos y los documentos. Este proceso también puede verse afectado, o bien por una mala elección de stopwords, o bien por la falta de precisión en la definición de reglas de lematización. En este sentido, en este trabajo comparamos y evaluamos diferentes criterios para el preprocesamiento de los documentos, así como también distintas estrategias para la selección de los index-terms. Esto nos permite no sólo reducir el tamaño de la estructura de indexación, sino también mejorar el proceso de identificación de temática. – Uno de los aspectos más importantes en cuanto al rendimiento de los sistemas de identificación de temática es la asignación de diferentes pesos a los términos de acuerdo a su contribución al contenido del documento. En este trabajo evaluamos y proponemos enfoques alternativos a los esquemas tradicionales de ponderado de términos (tales como tf-idf ) que nos permitan mejorar la especificidad de los términos, así como también discriminar mejor las temáticas de los documentos. _ Respecto a la adaptación dinámica de modelos de lenguaje, hemos dividimos el proceso de contextualización en varios pasos. – Para la generación de modelos de lenguaje basados en temática, proponemos dos tipos de enfoques: un enfoque supervisado y un enfoque no supervisado. En el primero de ellos nos basamos en las etiquetas de temática que originalmente acompañan a los documentos del corpus que empleamos. A partir de estas, agrupamos los documentos que forman parte de la misma temática y generamos modelos de lenguaje a partir de dichos grupos. Sin embargo, uno de los objetivos que se persigue en esta Tesis es evaluar si el uso de estas etiquetas para la generación de modelos es óptimo en términos del rendimiento del reconocedor. Por esta razón, nosotros proponemos un segundo enfoque, un enfoque no supervisado, en el cual el objetivo es agrupar, automáticamente, los documentos en clusters temáticos, basándonos en la similaridad semántica existente entre los documentos. Por medio de enfoques de agrupamiento conseguimos mejorar la cohesión conceptual y semántica en cada uno de los clusters, lo que a su vez nos permitió refinar los modelos de lenguaje basados en temática y mejorar el rendimiento del sistema de reconocimiento. – Desarrollamos diversas estrategias para generar un modelo de lenguaje dependiente del contexto. Nuestro objetivo es que este modelo refleje el contexto semántico del habla, i.e. las temáticas más relevantes que se están discutiendo. Este modelo es generado por medio de la interpolación lineal entre aquellos modelos de lenguaje basados en temática que estén relacionados con las temáticas más relevantes. La estimación de los pesos de interpolación está basada principalmente en el resultado del proceso de identificación de temática. – Finalmente, proponemos una metodología para la adaptación dinámica de un modelo de lenguaje general. El proceso de adaptación tiene en cuenta no sólo al modelo dependiente del contexto sino también a la información entregada por el proceso de identificación de temática. El esquema usado para la adaptación es una interpolación lineal entre el modelo general y el modelo dependiente de contexto. Estudiamos también diferentes enfoques para determinar los pesos de interpolación entre ambos modelos. Una vez definida la base teórica de nuestro marco de contextualización, proponemos su aplicación dentro de un sistema automático de reconocimiento de voz. Para esto, nos enfocamos en dos aspectos: la contextualización de los modelos de lenguaje empleados por el sistema y la incorporación de información semántica en el proceso de adaptación basado en temática. En esta Tesis proponemos un marco experimental basado en una arquitectura de reconocimiento en ‘dos etapas’. En la primera etapa, empleamos sistemas basados en técnicas de recuperación de información y aprendizaje de máquina para identificar las temáticas sobre las cuales se habla en una transcripción de un segmento de audio. Esta transcripción es generada por el sistema de reconocimiento empleando un modelo de lenguaje general. De acuerdo con la relevancia de las temáticas que han sido identificadas, se lleva a cabo la adaptación dinámica del modelo de lenguaje. En la segunda etapa de la arquitectura de reconocimiento, usamos este modelo adaptado para realizar de nuevo el reconocimiento del segmento de audio. Para determinar los beneficios del marco de trabajo propuesto, llevamos a cabo la evaluación de cada uno de los sistemas principales previamente mencionados. Esta evaluación es realizada sobre discursos en el dominio de la política usando la base de datos EPPS (European Parliamentary Plenary Sessions - Sesiones Plenarias del Parlamento Europeo) del proyecto europeo TC-STAR. Analizamos distintas métricas acerca del rendimiento de los sistemas y evaluamos las mejoras propuestas con respecto a los sistemas de referencia. ABSTRACT The last decade has witnessed major advances in speech recognition technology. Today’s commercial systems are able to recognize continuous speech from numerous speakers, with acceptable levels of error and without the need for an explicit adaptation procedure. Despite this progress, speech recognition is far from being a solved problem. Most of these systems are adjusted to a particular domain and their efficacy depends significantly, among many other aspects, on the similarity between the language model used and the task that is being addressed. This dependence is even more important in scenarios where the statistical properties of the language fluctuates throughout the time, for example, in application domains involving spontaneous and multitopic speech. Over the last years there has been an increasing effort in enhancing the speech recognition systems for such domains. This has been done, among other approaches, by means of techniques of automatic adaptation. These techniques are applied to the existing systems, specially since exporting the system to a new task or domain may be both time-consuming and expensive. Adaptation techniques require additional sources of information, and the spoken language could provide some of them. It must be considered that speech not only conveys a message, it also provides information on the context in which the spoken communication takes place (e.g. on the subject on which it is being talked about). Therefore, when we communicate through speech, it could be feasible to identify the elements of the language that characterize the context, and at the same time, to track the changes that occur in those elements over time. This information can be extracted and exploited through techniques of information retrieval and machine learning. This allows us, within the development of more robust speech recognition systems, to enhance the adaptation of language models to the conditions of the context, thus strengthening the recognition system for domains under changing conditions (such as potential variations in vocabulary, style and topic). In this sense, the main contribution of this Thesis is the proposal and evaluation of a framework of topic-motivated contextualization based on the dynamic and non-supervised adaptation of language models for the enhancement of an automatic speech recognition system. This adaptation is based on an combined approach (from the perspective of both information retrieval and machine learning fields) whereby we identify the topics that are being discussed in an audio recording. The topic identification, therefore, enables the system to perform an adaptation of the language model according to the contextual conditions. The proposed framework can be divided in two major systems: a topic identification system and a dynamic language model adaptation system. This Thesis can be outlined from the perspective of the particular contributions made in each of the fields that composes the proposed framework: _ Regarding the topic identification system, we have focused on the enhancement of the document preprocessing techniques in addition to contributing in the definition of more robust criteria for the selection of index-terms. – Within both information retrieval and machine learning based approaches, the efficiency of topic identification systems, depends, to a large extent, on the mechanisms of preprocessing applied to the documents. Among the many operations that encloses the preprocessing procedures, an adequate selection of index-terms is critical to establish conceptual and semantic relationships between terms and documents. This process might also be weakened by a poor choice of stopwords or lack of precision in defining stemming rules. In this regard we compare and evaluate different criteria for preprocessing the documents, as well as for improving the selection of the index-terms. This allows us to not only reduce the size of the indexing structure but also to strengthen the topic identification process. – One of the most crucial aspects, in relation to the performance of topic identification systems, is to assign different weights to different terms depending on their contribution to the content of the document. In this sense we evaluate and propose alternative approaches to traditional weighting schemes (such as tf-idf ) that allow us to improve the specificity of terms, and to better identify the topics that are related to documents. _ Regarding the dynamic language model adaptation, we divide the contextualization process into different steps. – We propose supervised and unsupervised approaches for the generation of topic-based language models. The first of them is intended to generate topic-based language models by grouping the documents, in the training set, according to the original topic labels of the corpus. Nevertheless, a goal of this Thesis is to evaluate whether or not the use of these labels to generate language models is optimal in terms of recognition accuracy. For this reason, we propose a second approach, an unsupervised one, in which the objective is to group the data in the training set into automatic topic clusters based on the semantic similarity between the documents. By means of clustering approaches we expect to obtain a more cohesive association of the documents that are related by similar concepts, thus improving the coverage of the topic-based language models and enhancing the performance of the recognition system. – We develop various strategies in order to create a context-dependent language model. Our aim is that this model reflects the semantic context of the current utterance, i.e. the most relevant topics that are being discussed. This model is generated by means of a linear interpolation between the topic-based language models related to the most relevant topics. The estimation of the interpolation weights is based mainly on the outcome of the topic identification process. – Finally, we propose a methodology for the dynamic adaptation of a background language model. The adaptation process takes into account the context-dependent model as well as the information provided by the topic identification process. The scheme used for the adaptation is a linear interpolation between the background model and the context-dependent one. We also study different approaches to determine the interpolation weights used in this adaptation scheme. Once we defined the basis of our topic-motivated contextualization framework, we propose its application into an automatic speech recognition system. We focus on two aspects: the contextualization of the language models used by the system, and the incorporation of semantic-related information into a topic-based adaptation process. To achieve this, we propose an experimental framework based in ‘a two stages’ recognition architecture. In the first stage of the architecture, Information Retrieval and Machine Learning techniques are used to identify the topics in a transcription of an audio segment. This transcription is generated by the recognition system using a background language model. According to the confidence on the topics that have been identified, the dynamic language model adaptation is carried out. In the second stage of the recognition architecture, an adapted language model is used to re-decode the utterance. To test the benefits of the proposed framework, we carry out the evaluation of each of the major systems aforementioned. The evaluation is conducted on speeches of political domain using the EPPS (European Parliamentary Plenary Sessions) database from the European TC-STAR project. We analyse several performance metrics that allow us to compare the improvements of the proposed systems against the baseline ones.
Resumo:
The aim of this Master Thesis is the analysis, design and development of a robust and reliable Human-Computer Interaction interface, based on visual hand-gesture recognition. The implementation of the required functions is oriented to the simulation of a classical hardware interaction device: the mouse, by recognizing a specific hand-gesture vocabulary in color video sequences. For this purpose, a prototype of a hand-gesture recognition system has been designed and implemented, which is composed of three stages: detection, tracking and recognition. This system is based on machine learning methods and pattern recognition techniques, which have been integrated together with other image processing approaches to get a high recognition accuracy and a low computational cost. Regarding pattern recongition techniques, several algorithms and strategies have been designed and implemented, which are applicable to color images and video sequences. The design of these algorithms has the purpose of extracting spatial and spatio-temporal features from static and dynamic hand gestures, in order to identify them in a robust and reliable way. Finally, a visual database containing the necessary vocabulary of gestures for interacting with the computer has been created.
Resumo:
Geographic Information Systems are developed to handle enormous volumes of data and are equipped with numerous functionalities intended to capture, store, edit, organise, process and analyse or represent the geographically referenced information. On the other hand, industrial simulators for driver training are real-time applications that require a virtual environment, either geospecific, geogeneric or a combination of the two, over which the simulation programs will be run. In the final instance, this environment constitutes a geographic location with its specific characteristics of geometry, appearance, functionality, topography, etc. The set of elements that enables the virtual simulation environment to be created and in which the simulator user can move, is usually called the Visual Database (VDB). The main idea behind the work being developed approaches a topic that is of major interest in the field of industrial training simulators, which is the problem of analysing, structuring and describing the virtual environments to be used in large driving simulators. This paper sets out a methodology that uses the capabilities and benefits of Geographic Information Systems for organising, optimising and managing the visual Database of the simulator and for generally enhancing the quality and performance of the simulator.
Resumo:
Los Sistemas de Información Geográfica están desarrollados para gestionar grandes volúmenes de datos, y disponen de numerosas funcionalidades orientadas a la captura, almacenamiento, edición, organización, procesado, análisis, o a la representación de información geográficamente referenciada. Por otro lado, los simuladores industriales para entrenamiento en tareas de conducción son aplicaciones en tiempo real que necesitan de un entorno virtual, ya sea geoespecífico, geogenérico, o combinación de ambos tipos, sobre el cual se ejecutarán los programas propios de la simulación. Este entorno, en última instancia, constituye un lugar geográfico, con sus características específicas geométricas, de aspecto, funcionales, topológicas, etc. Al conjunto de elementos que permiten la creación del entorno virtual de simulación dentro del cual se puede mover el usuario del simulador se denomina habitualmente Base de Datos del Visual (BDV). La idea principal del trabajo que se desarrolla aborda un tema del máximo interés en el campo de los simuladores industriales de formación, como es el problema que presenta el análisis, la estructuración, y la descripción de los entornos virtuales a emplear en los grandes simuladores de conducción. En este artículo se propone una metodología de trabajo en la que se aprovechan las capacidades y ventajas de los Sistemas de Información Geográfica para organizar, optimizar y gestionar la base de datos visual del simulador, y para mejorar la calidad y el rendimiento del simulador en general. ABSTRACT Geographic Information Systems are developed to handle enormous volumes of data and are equipped with numerous functionalities intended to capture, store, edit, organise, process and analyse or represent the geographically referenced information. On the other hand, industrial simulators for driver training are real-time applications that require a virtual environment, either geospecific, geogeneric or a combination of the two, over which the simulation programs will be run. In the final instance, this environment constitutes a geographic location with its specific characteristics of geometry, appearance, functionality, topography, etc. The set of elements that enables the virtual simulation environment to be created and in which the simulator user can move, is usually called the Visual Database (VDB). The main idea behind the work being developed approaches a topic that is of major interest in the field of industrial training simulators, which is the problem of analysing, structuring and describing the virtual environments to be used in large driving simulators. This paper sets out a methodology that uses the capabilities and benefits of Geographic Information Systems for organising, optimising and managing the visual Database of the simulator and for generally enhancing the quality and performance of the simulator.
Resumo:
Detecting user affect automatically during real-time conversation is the main challenge towards our greater aim of infusing social intelligence into a natural-language mixed-initiative High-Fidelity (Hi-Fi) audio control spoken dialog agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labelling and machine prediction. This paper attempts to address part of this challenge by considering the role of user satisfaction ratings and also conversational/dialog features in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. However, given the laboratory constraints, users might be positively biased when rating the system, indirectly making the reliability of the satisfaction data questionable. Machine learning experiments were conducted on two datasets, users and annotators, which were then compared in order to assess the reliability of these datasets. Our results indicated that standard classifiers were significantly more successful in discriminating the abovementioned emotions and their intensities (reflected by user satisfaction ratings) from annotator data than from user data. These results corroborated that: first, satisfaction data could be used directly as an alternative target variable to model affect, and that they could be predicted exclusively by dialog features. Second, these were only true when trying to predict the abovementioned emotions using annotator?s data, suggesting that user bias does exist in a laboratory-led evaluation.
Resumo:
This paper describes a novel approach to phonotactic LID, where instead of using soft-counts based on phoneme lattices, we use posteriogram to obtain n-gram counts. The high-dimensional vectors of counts are reduced to low-dimensional units for which we adapted the commonly used term i-vectors. The reduction is based on multinomial subspace modeling and is designed to work in the total-variability space. The proposed technique was tested on the NIST 2009 LRE set with better results to a system based on using soft-counts (Cavg on 30s: 3.15% vs 3.43%), and with very good results when fused with an acoustic i-vector LID system (Cavg on 30s acoustic 2.4% vs 1.25%). The proposed technique is also compared with another low dimensional projection system based on PCA. In comparison with the original soft-counts, the proposed technique provides better results, reduces the problems due to sparse counts, and avoids the process of using pruning techniques when creating the lattices.
Resumo:
In order to obtain more human like sounding humanmachine interfaces we must first be able to give them expressive capabilities in the way of emotional and stylistic features so as to closely adequate them to the intended task. If we want to replicate those features it is not enough to merely replicate the prosodic information of fundamental frequency and speaking rhythm. The proposed additional layer is the modification of the glottal model, for which we make use of the GlottHMM parameters. This paper analyzes the viability of such an approach by verifying that the expressive nuances are captured by the aforementioned features, obtaining 95% recognition rates on styled speaking and 82% on emotional speech. Then we evaluate the effect of speaker bias and recording environment on the source modeling in order to quantify possible problems when analyzing multi-speaker databases. Finally we propose a speaking styles separation for Spanish based on prosodic features and check its perceptual significance.
Resumo:
This paper presents a description of our system for the Albayzin 2012 LRE competition. One of the main characteristics of this evaluation was the reduced number of available files for training the system, especially for the empty condition where no training data set was provided but only a development set. In addition, the whole database was created from online videos and around one third of the training data was labeled as noisy files. Our primary system was the fusion of three different i-vector based systems: one acoustic system based on MFCCs, a phonotactic system using trigrams of phone-posteriorgram counts, and another acoustic system based on RPLPs that improved robustness against noise. A contrastive system that included new features based on the glottal source was also presented. Official and postevaluation results for all the conditions using the proposed metrics for the evaluation and the Cavg metric are presented in the paper.
Resumo:
When designing human-machine interfaces it is important to consider not only the bare bones functionality but also the ease of use and accessibility it provides. When talking about voice-based inter- faces, it has been proven that imbuing expressiveness into the synthetic voices increases signi?cantly its perceived naturalness, which in the end is very helpful when building user friendly interfaces. This paper proposes an adaptation based expressiveness transplantation system capable of copying the emotions of a source speaker into any desired target speaker with just a few minutes of read speech and without requiring the record- ing of additional expressive data. This system was evaluated through a perceptual test for 3 speakers showing up to an average of 52% emotion recognition rates relative to the natural voice recognition rates, while at the same time keeping good scores in similarity and naturality.
Resumo:
One of the main challenges for intelligent vehicles is the capability of detecting other vehicles in their environment, which constitute the main source of accidents. Specifically, many methods have been proposed in the literature for video-based vehicle detection. Most of them perform supervised classification using some appearance-related feature, in particular, symmetry has been extensively utilized. However, an in-depth analysis of the classification power of this feature is missing. As a first contribution of this paper, a thorough study of the classification performance of symmetry is presented within a Bayesian decision framework. This study reveals that the performance of symmetry-based classification is very limited. Therefore, as a second contribution, a new gradient-based descriptor is proposed for vehicle detection. This descriptor exploits the known rectangular structure of vehicle rears within a Histogram of Gradients (HOG)-based framework. Experiments show that the proposed descriptor outperforms largely symmetry as a feature for vehicle verification, achieving classification rates over 90%.
Resumo:
Durante el proceso de producción de voz, los factores anatómicos, fisiológicos o psicosociales del individuo modifican los órganos resonadores, imprimiendo en la voz características particulares. Los sistemas ASR tratan de encontrar los matices característicos de una voz y asociarlos a un individuo o grupo. La edad y sexo de un hablante son factores intrínsecos que están presentes en la voz. Este trabajo intenta diferenciar esas características, aislarlas y usarlas para detectar el género y la edad de un hablante. Para dicho fin, se ha realizado el estudio y análisis de las características basadas en el pulso glótico y el tracto vocal, evitando usar técnicas clásicas (como pitch y sus derivados) debido a las restricciones propias de dichas técnicas. Los resultados finales de nuestro estudio alcanzan casi un 100% en reconocimiento de género mientras en la tarea de reconocimiento de edad el reconocimiento se encuentra alrededor del 80%. Parece ser que la voz queda afectada por el género del hablante y las hormonas, aunque no se aprecie en la audición. ABSTRACT Particular elements of the voice are printed during the speech production process and are related to anatomical and physiological factors of the phonatory system or psychosocial factors acquired by the speaker. ASR systems attempt to find those peculiar nuances of a voice and associate them to an individual or a group. Age and gender are inherent factors to the speaker which may be represented in voice. This work attempts to differentiate those characteristics, isolate them and use them to detect speaker’s gender and age. Features based on glottal pulse and vocal tract are studied and analyzed in order to achieve good results in both tasks. Classical methodologies (such as pitch and derivates) are avoided since the requirements of those techniques may be too restrictive. The final scores achieve almost 100% in gender recognition whereas in age recognition those scores are around 80%. Factors related to the gender and hormones seem to affect the voice although they are not audible.
Resumo:
This paper presents new techniques with relevant improvements added to the primary system presented by our group to the Albayzin 2012 LRE competition, where the use of any additional corpora for training or optimizing the models was forbidden. In this work, we present the incorporation of an additional phonotactic subsystem based on the use of phone log-likelihood ratio features (PLLR) extracted from different phonotactic recognizers that contributes to improve the accuracy of the system in a 21.4% in terms of Cavg (we also present results for the official metric during the evaluation, Fact). We will present how using these features at the phone state level provides significant improvements, when used together with dimensionality reduction techniques, especially PCA. We have also experimented with applying alternative SDC-like configurations on these PLLR features with additional improvements. Also, we will describe some modifications to the MFCC-based acoustic i-vector system which have also contributed to additional improvements. The final fused system outperformed the baseline in 27.4% in Cavg.
Resumo:
Autonomous landing is a challenging and important technology for both military and civilian applications of Unmanned Aerial Vehicles (UAVs). In this paper, we present a novel online adaptive visual tracking algorithm for UAVs to land on an arbitrary field (that can be used as the helipad) autonomously at real-time frame rates of more than twenty frames per second. The integration of low-dimensional subspace representation method, online incremental learning approach and hierarchical tracking strategy allows the autolanding task to overcome the problems generated by the challenging situations such as significant appearance change, variant surrounding illumination, partial helipad occlusion, rapid pose variation, onboard mechanical vibration (no video stabilization), low computational capacity and delayed information communication between UAV and Ground Control Station (GCS). The tracking performance of this presented algorithm is evaluated with aerial images from real autolanding flights using manually- labelled ground truth database. The evaluation results show that this new algorithm is highly robust to track the helipad and accurate enough for closing the vision-based control loop.
Resumo:
Autonomous landing is a challenging and important technology for both military and civilian applications of Unmanned Aerial Vehicles (UAVs). In this paper, we present a novel online adaptive visual tracking algorithm for UAVs to land on an arbitrary field (that can be used as the helipad) autonomously at real-time frame rates of more than twenty frames per second. The integration of low-dimensional subspace representation method, online incremental learning approach and hierarchical tracking strategy allows the autolanding task to overcome the problems generated by the challenging situations such as significant appearance change, variant surrounding illumination, partial helipad occlusion, rapid pose variation, onboard mechanical vibration (no video stabilization), low computational capacity and delayed information communication between UAV and Ground Control Station (GCS). The tracking performance of this presented algorithm is evaluated with aerial images from real autolanding flights using manually- labelled ground truth database. The evaluation results show that this new algorithm is highly robust to track the helipad and accurate enough for closing the vision-based control loop.
Resumo:
This paper presents a novel robust visual tracking framework, based on discriminative method, for Unmanned Aerial Vehicles (UAVs) to track an arbitrary 2D/3D target at real-time frame rates, that is called the Adaptive Multi-Classifier Multi-Resolution (AMCMR) framework. In this framework, adaptive Multiple Classifiers (MC) are updated in the (k-1)th frame-based Multiple Resolutions (MR) structure with compressed positive and negative samples, and then applied them in the kth frame-based Multiple Resolutions (MR) structure to detect the current target. The sample importance has been integrated into this framework to improve the tracking stability and accuracy. The performance of this framework was evaluated with the Ground Truth (GT) in different types of public image databases and real flight-based aerial image datasets firstly, then the framework has been applied in the UAV to inspect the Offshore Floating Platform (OFP). The evaluation and application results show that this framework is more robust, efficient and accurate against the existing state-of-art trackers, overcoming the problems generated by the challenging situations such as obvious appearance change, variant illumination, partial/full target occlusion, blur motion, rapid pose variation and onboard mechanical vibration, among others. To our best knowledge, this is the first work to present this framework for solving the online learning and tracking freewill 2D/3D target problems, and applied it in the UAVs.