39 resultados para Automatic speech recognition (ASR)
Resumo:
This paper describes a novel approach to phonotactic LID, where instead of using soft-counts based on phoneme lattices, we use posteriogram to obtain n-gram counts. The high-dimensional vectors of counts are reduced to low-dimensional units for which we adapted the commonly used term i-vectors. The reduction is based on multinomial subspace modeling and is designed to work in the total-variability space. The proposed technique was tested on the NIST 2009 LRE set with better results to a system based on using soft-counts (Cavg on 30s: 3.15% vs 3.43%), and with very good results when fused with an acoustic i-vector LID system (Cavg on 30s acoustic 2.4% vs 1.25%). The proposed technique is also compared with another low dimensional projection system based on PCA. In comparison with the original soft-counts, the proposed technique provides better results, reduces the problems due to sparse counts, and avoids the process of using pruning techniques when creating the lattices.
Resumo:
In order to obtain more human like sounding humanmachine interfaces we must first be able to give them expressive capabilities in the way of emotional and stylistic features so as to closely adequate them to the intended task. If we want to replicate those features it is not enough to merely replicate the prosodic information of fundamental frequency and speaking rhythm. The proposed additional layer is the modification of the glottal model, for which we make use of the GlottHMM parameters. This paper analyzes the viability of such an approach by verifying that the expressive nuances are captured by the aforementioned features, obtaining 95% recognition rates on styled speaking and 82% on emotional speech. Then we evaluate the effect of speaker bias and recording environment on the source modeling in order to quantify possible problems when analyzing multi-speaker databases. Finally we propose a speaking styles separation for Spanish based on prosodic features and check its perceptual significance.
Resumo:
This paper describes a low complexity strategy for detecting and recognizing text signs automatically. Traditional approaches use large image algorithms for detecting the text sign, followed by the application of an Optical Character Recognition (OCR) algorithm in the previously identified areas. This paper proposes a new architecture that applies the OCR to a whole lightly treated image and then carries out the text detection process of the OCR output. The strategy presented in this paper significantly reduces the processing time required for text localization in an image, while guaranteeing a high recognition rate. This strategy will facilitate the incorporation of video processing-based applications into the automatic detection of text sign similar to that of a smartphone. These applications will increase the autonomy of visually impaired people in their daily life.
Resumo:
This paper presents a description of our system for the Albayzin 2012 LRE competition. One of the main characteristics of this evaluation was the reduced number of available files for training the system, especially for the empty condition where no training data set was provided but only a development set. In addition, the whole database was created from online videos and around one third of the training data was labeled as noisy files. Our primary system was the fusion of three different i-vector based systems: one acoustic system based on MFCCs, a phonotactic system using trigrams of phone-posteriorgram counts, and another acoustic system based on RPLPs that improved robustness against noise. A contrastive system that included new features based on the glottal source was also presented. Official and postevaluation results for all the conditions using the proposed metrics for the evaluation and the Cavg metric are presented in the paper.
Resumo:
When designing human-machine interfaces it is important to consider not only the bare bones functionality but also the ease of use and accessibility it provides. When talking about voice-based inter- faces, it has been proven that imbuing expressiveness into the synthetic voices increases signi?cantly its perceived naturalness, which in the end is very helpful when building user friendly interfaces. This paper proposes an adaptation based expressiveness transplantation system capable of copying the emotions of a source speaker into any desired target speaker with just a few minutes of read speech and without requiring the record- ing of additional expressive data. This system was evaluated through a perceptual test for 3 speakers showing up to an average of 52% emotion recognition rates relative to the natural voice recognition rates, while at the same time keeping good scores in similarity and naturality.
Resumo:
The aim of automatic pathological voice detection systems is to serve as tools, to medical specialists, for a more objective, less invasive and improved diagnosis of diseases. In this respect, the gold standard for those system include the usage of a optimized representation of the spectral envelope, either based on cepstral coefcients from the mel-scaled Fourier spectral envelope (Mel-Frequency Cepstral Coefcients) or from an all-pole estimation (Linear Prediction Coding Cepstral Coefcients) forcharacterization, and Gaussian Mixture Models for posterior classication. However, the study of recently proposed GMM-based classiers as well as Nuisance mitigation techniques, such as those employed in speaker recognition, has not been widely considered inpathology detection labours. The present work aims at testing whether or not the employment of such speaker recognition tools might contribute to improve system performance in pathology detection systems, specically in the automatic detection of Obstructive Sleep Apnea. The testing procedure employs an Obstructive Sleep Apnea database, in conjunction with GMM-based classiers looking for a better performance. The results show that an improved performance might be obtained by using such approach.
Resumo:
MFCC coefficients extracted from the power spectral density of speech as a whole, seems to have become the de facto standard in the area of speaker recognition, as demonstrated by its use in almost all systems submitted to the 2013 Speaker Recognition Evaluation (SRE) in Mobile Environment [1], thus relegating to background this component of the recognition systems. However, in this article we will show that selecting the adequate speaker characterization system is as important as the selection of the classifier. To accomplish this we will compare the recognition rates achieved by different recognition systems that relies on the same classifier (GMM-UBM) but connected with different feature extraction systems (based on both classical and biometric parameters). As a result we will show that a gender dependent biometric parameterization with a simple recognition system based on GMM- UBM paradigm provides very competitive or even better recognition rates when compared to more complex classification systems based on classical features
Resumo:
The Project you are about to see it is based on the technologies used on object detection and recognition, especially on leaves and chromosomes. To do so, this document contains the typical parts of a scientific paper, as it is what it is. It is composed by an Abstract, an Introduction, points that have to do with the investigation area, future work, conclusions and references used for the elaboration of the document. The Abstract talks about what are we going to find in this paper, which is technologies employed on pattern detection and recognition for leaves and chromosomes and the jobs that are already made for cataloguing these objects. In the introduction detection and recognition meanings are explained. This is necessary as many papers get confused with these terms, specially the ones talking about chromosomes. Detecting an object is gathering the parts of the image that are useful and eliminating the useless parts. Summarizing, detection would be recognizing the objects borders. When talking about recognition, we are talking about the computers or the machines process, which says what kind of object we are handling. Afterwards we face a compilation of the most used technologies in object detection in general. There are two main groups on this category: Based on derivatives of images and based on ASIFT points. The ones that are based on derivatives of images have in common that convolving them with a previously created matrix does the treatment of them. This is done for detecting borders on the images, which are changes on the intensity of the pixels. Within these technologies we face two groups: Gradian based, which search for maximums and minimums on the pixels intensity as they only use the first derivative. The Laplacian based methods search for zeros on the pixels intensity as they use the second derivative. Depending on the level of details that we want to use on the final result, we will choose one option or the other, because, as its logic, if we used Gradian based methods, the computer will consume less resources and less time as there are less operations, but the quality will be worse. On the other hand, if we use the Laplacian based methods we will need more time and resources as they require more operations, but we will have a much better quality result. After explaining all the derivative based methods, we take a look on the different algorithms that are available for both groups. The other big group of technologies for object recognition is the one based on ASIFT points, which are based on 6 image parameters and compare them with another image taking under consideration these parameters. These methods disadvantage, for our future purposes, is that it is only valid for one single object. So if we are going to recognize two different leaves, even though if they refer to the same specie, we are not going to be able to recognize them with this method. It is important to mention these types of technologies as we are talking about recognition methods in general. At the end of the chapter we can see a comparison with pros and cons of all technologies that are employed. Firstly comparing them separately and then comparing them all together, based on our purposes. Recognition techniques, which are the next chapter, are not really vast as, even though there are general steps for doing object recognition, every single object that has to be recognized has its own method as the are different. This is why there is not a general method that we can specify on this chapter. We now move on into leaf detection techniques on computers. Now we will use the technique explained above based on the image derivatives. Next step will be to turn the leaf into several parameters. Depending on the document that you are referring to, there will be more or less parameters. Some papers recommend to divide the leaf into 3 main features (shape, dent and vein] and doing mathematical operations with them we can get up to 16 secondary features. Next proposition is dividing the leaf into 5 main features (Diameter, physiological length, physiological width, area and perimeter] and from those, extract 12 secondary features. This second alternative is the most used so it is the one that is going to be the reference. Following in to leaf recognition, we are based on a paper that provides a source code that, clicking on both leaf ends, it automatically tells to which specie belongs the leaf that we are trying to recognize. To do so, it only requires having a database. On the tests that have been made by the document, they assure us a 90.312% of accuracy over 320 total tests (32 plants on the database and 10 tests per specie]. Next chapter talks about chromosome detection, where we shall pass the metaphasis plate, where the chromosomes are disorganized, into the karyotype plate, which is the usual view of the 23 chromosomes ordered by number. There are two types of techniques to do this step: the skeletonization process and swiping angles. Skeletonization progress consists on suppressing the inside pixels of the chromosome to just stay with the silhouette. This method is really similar to the ones based on the derivatives of the image but the difference is that it doesnt detect the borders but the interior of the chromosome. Second technique consists of swiping angles from the beginning of the chromosome and, taking under consideration, that on a single chromosome we cannot have more than an X angle, it detects the various regions of the chromosomes. Once the karyotype plate is defined, we continue with chromosome recognition. To do so, there is a technique based on the banding that chromosomes have (grey scale bands] that make them unique. The program then detects the longitudinal axis of the chromosome and reconstructs the band profiles. Then the computer is able to recognize this chromosome. Concerning the future work, we generally have to independent techniques that dont reunite detection and recognition, so our main focus would be to prepare a program that gathers both techniques. On the leaf matter we have seen that, detection and recognition, have a link as both share the option of dividing the leaf into 5 main features. The work that would have to be done is to create an algorithm that linked both methods, as in the program, which recognizes leaves, it has to be clicked both leaf ends so it is not an automatic algorithm. On the chromosome side, we should create an algorithm that searches for the beginning of the chromosome and then start to swipe angles, to later give the parameters to the program that searches for the band profiles. Finally, on the summary, we explain why this type of investigation is needed, and that is because with global warming, lots of species (animals and plants] are beginning to extinguish. That is the reason why a big database, which gathers all the possible species, is needed. For recognizing animal species, we just only have to have the 23 chromosomes. While recognizing a plant, there are several ways of doing it, but the easiest way to input a computer is to scan the leaf of the plant. RESUMEN. El proyecto que se puede ver a continuacin trata sobre las tecnologas empleadas en la deteccin y reconocimiento de objetos, especialmente de hojas y cromosomas. Para ello, este documento contiene las partes tpicas de un paper de investigacin, puesto que es de lo que se trata. As, estar compuesto de Abstract, Introduccin, diversos puntos que tengan que ver con el rea a investigar, trabajo futuro, conclusiones y biografa utilizada para la realizacin del documento. As, el Abstract nos cuenta qu vamos a poder encontrar en este paper, que no es ni ms ni menos que las tecnologas empleadas en el reconocimiento y deteccin de patrones en hojas y cromosomas y qu trabajos hay existentes para catalogar a estos objetos. En la introduccin se explican los conceptos de qu es la deteccin y qu es el reconocimiento. Esto es necesario ya que muchos papers cientficos, especialmente los que hablan de cromosomas, confunden estos dos trminos que no podan ser ms sencillos. Por un lado tendramos la deteccin del objeto, que sera simplemente coger las partes que nos interesasen de la imagen y eliminar aquellas partes que no nos fueran tiles para un futuro. Resumiendo, sera reconocer los bordes del objeto de estudio. Cuando hablamos de reconocimiento, estamos refirindonos al proceso que tiene el ordenador, o la mquina, para decir qu clase de objeto estamos tratando. Seguidamente nos encontramos con un recopilatorio de las tecnologas ms utilizadas para la deteccin de objetos, en general. Aqu nos encontraramos con dos grandes grupos de tecnologas: Las basadas en las derivadas de imgenes y las basadas en los puntos ASIFT. El grupo de tecnologas basadas en derivadas de imgenes tienen en comn que hay que tratar a las imgenes mediante una convolucin con una matriz creada previamente. Esto se hace para detectar bordes en las imgenes que son bsicamente cambios en la intensidad de los pxeles. Dentro de estas tecnologas nos encontramos con dos grupos: Los basados en gradientes, los cuales buscan mximos y mnimos de intensidad en la imagen puesto que slo utilizan la primera derivada; y los Laplacianos, los cuales buscan ceros en la intensidad de los pxeles puesto que estos utilizan la segunda derivada de la imagen. Dependiendo del nivel de detalles que queramos utilizar en el resultado final nos decantaremos por un mtodo u otro puesto que, como es lgico, si utilizamos los basados en el gradiente habr menos operaciones por lo que consumir ms tiempo y recursos pero por la contra tendremos menos calidad de imagen. Y al revs pasa con los Laplacianos, puesto que necesitan ms operaciones y recursos pero tendrn un resultado final con mejor calidad. Despus de explicar los tipos de operadores que hay, se hace un recorrido explicando los distintos tipos de algoritmos que hay en cada uno de los grupos. El otro gran grupo de tecnologas para el reconocimiento de objetos son los basados en puntos ASIFT, los cuales se basan en 6 parmetros de la imagen y la comparan con otra imagen teniendo en cuenta dichos parmetros. La desventaja de este mtodo, para nuestros propsitos futuros, es que slo es valido para un objeto en concreto. Por lo que si vamos a reconocer dos hojas diferentes, aunque sean de la misma especie, no vamos a poder reconocerlas mediante este mtodo. An as es importante explicar este tipo de tecnologas puesto que estamos hablando de tcnicas de reconocimiento en general. Al final del captulo podremos ver una comparacin con los pros y las contras de todas las tecnologas empleadas. Primeramente comparndolas de forma separada y, finalmente, compararemos todos los mtodos existentes en base a nuestros propsitos. Las tcnicas de reconocimiento, el siguiente apartado, no es muy extenso puesto que, aunque haya pasos generales para el reconocimiento de objetos, cada objeto a reconocer es distinto por lo que no hay un mtodo especfico que se pueda generalizar. Pasamos ahora a las tcnicas de deteccin de hojas mediante ordenador. Aqu usaremos la tcnica explicada previamente explicada basada en las derivadas de las imgenes. La continuacin de este paso sera diseccionar la hoja en diversos parmetros. Dependiendo de la fuente a la que se consulte pueden haber ms o menos parmetros. Unos documentos aconsejan dividir la morfologa de la hoja en 3 parmetros principales (Forma, Dentina y ramificacin] y derivando de dichos parmetros convertirlos a 16 parmetros secundarios. La otra propuesta es dividir la morfologa de la hoja en 5 parmetros principales (Dimetro, longitud fisiolgica, anchura fisiolgica, rea y permetro] y de ah extraer 12 parmetros secundarios. Esta segunda propuesta es la ms utilizada de todas por lo que es la que se utilizar. Pasamos al reconocimiento de hojas, en la cual nos hemos basado en un documento que provee un cdigo fuente que cucando en los dos extremos de la hoja automticamente nos dice a qu especie pertenece la hoja que estamos intentando reconocer. Para ello slo hay que formar una base de datos. En los test realizados por el citado documento, nos aseguran que tiene un ndice de acierto del 90.312% en 320 test en total (32 plantas insertadas en la base de datos por 10 test que se han realizado por cada una de las especies]. El siguiente apartado trata de la deteccin de cromosomas, en el cual se debe de pasar de la clula metafsica, donde los cromosomas estn desorganizados, al cariotipo, que es como solemos ver los 23 cromosomas de forma ordenada. Hay dos tipos de tcnicas para realizar este paso: Por el proceso de esquelotonizacin y barriendo ngulos. El proceso de esqueletonizacin consiste en eliminar los pxeles del interior del cromosoma para quedarse con su silueta; Este proceso es similar a los mtodos de derivacin de los pxeles pero se diferencia en que no detecta bordes si no que detecta el interior de los cromosomas. La segunda tcnica consiste en ir barriendo ngulos desde el principio del cromosoma y teniendo en cuenta que un cromosoma no puede doblarse ms de X grados detecta las diversas regiones de los cromosomas. Una vez tengamos el cariotipo, se continua con el reconocimiento de cromosomas. Para ello existe una tcnica basada en las bandas de blancos y negros que tienen los cromosomas y que son las que los hacen nicos. Para ello el programa detecta los ejes longitudinales del cromosoma y reconstruye los perfiles de las bandas que posee el cromosoma y que lo identifican como nico. En cuanto al trabajo que se podra desempear en el futuro, tenemos por lo general dos tcnicas independientes que no unen la deteccin con el reconocimiento por lo que se habra de preparar un programa que uniese estas dos tcnicas. Respecto a las hojas hemos visto que ambos mtodos, deteccin y reconocimiento, estn vinculados debido a que ambos comparten la opinin de dividir las hojas en 5 parmetros principales. El trabajo que habra que realizar sera el de crear un algoritmo que conectase a ambos ya que en el programa de reconocimiento se debe clicar a los dos extremos de la hoja por lo que no es una tarea automtica. En cuanto a los cromosomas, se debera de crear un algoritmo que busque el inicio del cromosoma y entonces empiece a barrer ngulos para despus poder drselo al programa que busca los perfiles de bandas de los cromosomas. Finalmente, en el resumen se explica el por qu hace falta este tipo de investigacin, esto es que con el calentamiento global, muchas de las especies (tanto animales como plantas] se estn empezando a extinguir. Es por ello que se necesitar una base de datos que contemple todas las posibles especies tanto del reino animal como del reino vegetal. Para reconocer a una especie animal, simplemente bastar con tener sus 23 cromosomas; mientras que para reconocer a una especie vegetal, existen diversas formas. Aunque la ms sencilla de todas es contar con la hoja de la especie puesto que es el elemento ms fcil de escanear e introducir en el ordenador.
Resumo:
This paper presents new techniques with relevant improvements added to the primary system presented by our group to the Albayzin 2012 LRE competition, where the use of any additional corpora for training or optimizing the models was forbidden. In this work, we present the incorporation of an additional phonotactic subsystem based on the use of phone log-likelihood ratio features (PLLR) extracted from different phonotactic recognizers that contributes to improve the accuracy of the system in a 21.4% in terms of Cavg (we also present results for the official metric during the evaluation, Fact). We will present how using these features at the phone state level provides significant improvements, when used together with dimensionality reduction techniques, especially PCA. We have also experimented with applying alternative SDC-like configurations on these PLLR features with additional improvements. Also, we will describe some modifications to the MFCC-based acoustic i-vector system which have also contributed to additional improvements. The final fused system outperformed the baseline in 27.4% in Cavg.