939 resultados para Audio signal
Resumo:
Pitch Estimation, also known as Fundamental Frequency (F0) estimation, has been a popular research topic for many years, and is still investigated nowadays. The goal of Pitch Estimation is to find the pitch or fundamental frequency of a digital recording of a speech or musical notes. It plays an important role, because it is the key to identify which notes are being played and at what time. Pitch Estimation of real instruments is a very hard task to address. Each instrument has its own physical characteristics, which reflects in different spectral characteristics. Furthermore, the recording conditions can vary from studio to studio and background noises must be considered. This dissertation presents a novel approach to the problem of Pitch Estimation, using Cartesian Genetic Programming (CGP).We take advantage of evolutionary algorithms, in particular CGP, to explore and evolve complex mathematical functions that act as classifiers. These classifiers are used to identify piano notes pitches in an audio signal. To help us with the codification of the problem, we built a highly flexible CGP Toolbox, generic enough to encode different kind of programs. The encoded evolutionary algorithm is the one known as 1 + , and we can choose the value for . The toolbox is very simple to use. Settings such as the mutation probability, number of runs and generations are configurable. The cartesian representation of CGP can take multiple forms and it is able to encode function parameters. It is prepared to handle with different type of fitness functions: minimization of f(x) and maximization of f(x) and has a useful system of callbacks. We trained 61 classifiers corresponding to 61 piano notes. A training set of audio signals was used for each of the classifiers: half were signals with the same pitch as the classifier (true positive signals) and the other half were signals with different pitches (true negative signals). F-measure was used for the fitness function. Signals with the same pitch of the classifier that were correctly identified by the classifier, count as a true positives. Signals with the same pitch of the classifier that were not correctly identified by the classifier, count as a false negatives. Signals with different pitch of the classifier that were not identified by the classifier, count as a true negatives. Signals with different pitch of the classifier that were identified by the classifier, count as a false positives. Our first approach was to evolve classifiers for identifying artifical signals, created by mathematical functions: sine, sawtooth and square waves. Our function set is basically composed by filtering operations on vectors and by arithmetic operations with constants and vectors. All the classifiers correctly identified true positive signals and did not identify true negative signals. We then moved to real audio recordings. For testing the classifiers, we picked different audio signals from the ones used during the training phase. For a first approach, the obtained results were very promising, but could be improved. We have made slight changes to our approach and the number of false positives reduced 33%, compared to the first approach. We then applied the evolved classifiers to polyphonic audio signals, and the results indicate that our approach is a good starting point for addressing the problem of Pitch Estimation.
Resumo:
Acoustically, car cabins are extremely noisy and as a consequence audio-only, in-car voice recognition systems perform poorly. As the visual modality is immune to acoustic noise, using the visual lip information from the driver is seen as a viable strategy in circumventing this problem by using audio visual automatic speech recognition (AVASR). However, implementing AVASR requires a system being able to accurately locate and track the drivers face and lip area in real-time. In this paper we present such an approach using the Viola-Jones algorithm. Using the AVICAR [1] in-car database, we show that the Viola- Jones approach is a suitable method of locating and tracking the driver’s lips despite the visual variability of illumination and head pose for audio-visual speech recognition system.
Resumo:
Acoustically, car cabins are extremely noisy and as a consequence, existing audio-only speech recognition systems, for voice-based control of vehicle functions such as the GPS based navigator, perform poorly. Audio-only speech recognition systems fail to make use of the visual modality of speech (eg: lip movements). As the visual modality is immune to acoustic noise, utilising this visual information in conjunction with an audio only speech recognition system has the potential to improve the accuracy of the system. The field of recognising speech using both auditory and visual inputs is known as Audio Visual Speech Recognition (AVSR). Continuous research in AVASR field has been ongoing for the past twenty-five years with notable progress being made. However, the practical deployment of AVASR systems for use in a variety of real-world applications has not yet emerged. The main reason is due to most research to date neglecting to address variabilities in the visual domain such as illumination and viewpoint in the design of the visual front-end of the AVSR system. In this paper we present an AVASR system in a real-world car environment using the AVICAR database [1], which is publicly available in-car database and we show that the use of visual speech conjunction with the audio modality is a better approach to improve the robustness and effectiveness of voice-only recognition systems in car cabin environments.
Resumo:
The cascading appearance-based (CAB) feature extraction technique has established itself as the state-of-the-art in extracting dynamic visual speech features for speech recognition. In this paper, we will focus on investigating the effectiveness of this technique for the related speaker verification application. By investigating the speaker verification ability of each stage of the cascade we will demonstrate that the same steps taken to reduce static speaker and environmental information for the visual speech recognition application also provide similar improvements for visual speaker recognition. A further study is conducted comparing synchronous HMM (SHMM) based fusion of CAB visual features and traditional perceptual linear predictive (PLP) acoustic features to show that higher complexity inherit in the SHMM approach does not appear to provide any improvement in the final audio-visual speaker verification system over simpler utterance level score fusion.
Resumo:
Signal Processing (SP) is a subject of central importance in engineering and the applied sciences. Signals are information-bearing functions, and SP deals with the analysis and processing of signals (by dedicated systems) to extract or modify information. Signal processing is necessary because signals normally contain information that is not readily usable or understandable, or which might be disturbed by unwanted sources such as noise. Although many signals are non-electrical, it is common to convert them into electrical signals for processing. Most natural signals (such as acoustic and biomedical signals) are continuous functions of time, with these signals being referred to as analog signals. Prior to the onset of digital computers, Analog Signal Processing (ASP) and analog systems were the only tool to deal with analog signals. Although ASP and analog systems are still widely used, Digital Signal Processing (DSP) and digital systems are attracting more attention, due in large part to the significant advantages of digital systems over the analog counterparts. These advantages include superiority in performance,s peed, reliability, efficiency of storage, size and cost. In addition, DSP can solve problems that cannot be solved using ASP, like the spectral analysis of multicomonent signals, adaptive filtering, and operations at very low frequencies. Following the recent developments in engineering which occurred in the 1980's and 1990's, DSP became one of the world's fastest growing industries. Since that time DSP has not only impacted on traditional areas of electrical engineering, but has had far reaching effects on other domains that deal with information such as economics, meteorology, seismology, bioengineering, oceanology, communications, astronomy, radar engineering, control engineering and various other applications. This book is based on the Lecture Notes of Associate Professor Zahir M. Hussain at RMIT University (Melbourne, 2001-2009), the research of Dr. Amin Z. Sadik (at QUT & RMIT, 2005-2008), and the Note of Professor Peter O'Shea at Queensland University of Technology. Part I of the book addresses the representation of analog and digital signals and systems in the time domain and in the frequency domain. The core topics covered are convolution, transforms (Fourier, Laplace, Z. Discrete-time Fourier, and Discrete Fourier), filters, and random signal analysis. There is also a treatment of some important applications of DSP, including signal detection in noise, radar range estimation, banking and financial applications, and audio effects production. Design and implementation of digital systems (such as integrators, differentiators, resonators and oscillators are also considered, along with the design of conventional digital filters. Part I is suitable for an elementary course in DSP. Part II (which is suitable for an advanced signal processing course), considers selected signal processing systems and techniques. Core topics covered are the Hilbert transformer, binary signal transmission, phase-locked loops, sigma-delta modulation, noise shaping, quantization, adaptive filters, and non-stationary signal analysis. Part III presents some selected advanced DSP topics. We hope that this book will contribute to the advancement of engineering education and that it will serve as a general reference book on digital signal processing.
Resumo:
The use of visual features in the form of lip movements to improve the performance of acoustic speech recognition has been shown to work well, particularly in noisy acoustic conditions. However, whether this technique can outperform speech recognition incorporating well-known acoustic enhancement techniques, such as spectral subtraction, or multi-channel beamforming is not known. This is an important question to be answered especially in an automotive environment, for the design of an efficient human-vehicle computer interface. We perform a variety of speech recognition experiments on a challenging automotive speech dataset and results show that synchronous HMM-based audio-visual fusion can outperform traditional single as well as multi-channel acoustic speech enhancement techniques. We also show that further improvement in recognition performance can be obtained by fusing speech-enhanced audio with the visual modality, demonstrating the complementary nature of the two robust speech recognition approaches.