919 resultados para Regressão da audição
Resumo:
This paper presents a low-bandwidth multi-robot communication system designed to serve as a backup communication channel in the event a robot suffers a network device fault. While much research has been performed in the area of distributing network communication across multiple robots within a system, individual robots are still susceptible to hardware failure. In the past, such robots would simply be removed from service, and their tasks re-allocated to other members. However, there are times when a faulty robot might be crucial to a mission, or be able to contribute in a less communication intensive area. By allowing robots to encode and decode messages into unique sequences of DTMF symbols, called words, our system is able to facilitate continued low-bandwidth communication between robots without access to network communication. Our results have shown that the system is capable of permitting robots to negotiate task initiation and termination, and is flexible enough to permit a pair of robots to perform a simple turn taking task.
Resumo:
It is impracticable to upgrade the 18,900 Australian passive crossings as such crossings are often located in remote areas, where power is lacking and with low road and rail traffic. The rail industry is interested in developing innovative in-vehicle technology interventions to warn motorists of approaching trains directly in their vehicles. The objective of this study was therefore to evaluate the benefits of the introduction of such technology. We evaluated the changes in driver performance once the technology is enabled and functioning correctly, as well as the effects of an unsafe failure of the technology? We conducted a driving simulator study where participants (N=15) were familiarised with an in-vehicle audio warning for an extended period. After being familiarised with the system, the technology started failing, and we tested the reaction of drivers with a train approaching. This study has shown that with the traditional passive crossings with RX2 signage, the majority of drivers complied (70%) and looked for trains on both sides of the rail track. With the introduction of the in-vehicle audio message, drivers did not approach crossings faster, did not reduce their safety margins and did not reduce their gaze towards the rail tracks. However participants’ compliance at the stop sign decreased by 16.5% with the technology installed in the vehicle. The effect of the failure of the in-vehicle audio warning technology showed that most participants did not experience difficulties in detecting the approaching train even though they did not receive any warning message. This showed that participants were still actively looking for trains with the system in their vehicle. However, two participants did not stop and one decided to beat the train when they did not receive the audio message, suggesting potential human factors issues to be considered with such technology.
Resumo:
We propose a novel technique for conducting robust voice activity detection (VAD) in high-noise recordings. We use Gaussian mixture modeling (GMM) to train two generic models; speech and non-speech. We then score smaller segments of a given (unseen) recording against each of these GMMs to obtain two respective likelihood scores for each segment. These scores are used to compute a dissimilarity measure between pairs of segments and to carry out complete-linkage clustering of the segments into speech and non-speech clusters. We compare the accuracy of our method against state-of-the-art and standardised VAD techniques to demonstrate an absolute improvement of 15% in half-total error rate (HTER) over the best performing baseline system and across the QUT-NOISE-TIMIT database. We then apply our approach to the Audio-Visual Database of American English (AVDBAE) to demonstrate the performance of our algorithm in using visual, audio-visual or a proposed fusion of these features.
Resumo:
Visual information in the form of lip movements of the speaker has been shown to improve the performance of speech recognition and search applications. In our previous work, we proposed cross database training of synchronous hidden Markov models (SHMMs) to make use of external large and publicly available audio databases in addition to the relatively small given audio visual database. In this work, the cross database training approach is improved by performing an additional audio adaptation step, which enables audio visual SHMMs to benefit from audio observations of the external audio models before adding visual modality to them. The proposed approach outperforms the baseline cross database training approach in clean and noisy environments in terms of phone recognition accuracy as well as spoken term detection (STD) accuracy.
Resumo:
Speech recognition can be improved by using visual information in the form of lip movements of the speaker in addition to audio information. To date, state-of-the-art techniques for audio-visual speech recognition continue to use audio and visual data of the same database for training their models. In this paper, we present a new approach to make use of one modality of an external dataset in addition to a given audio-visual dataset. By so doing, it is possible to create more powerful models from other extensive audio-only databases and adapt them on our comparatively smaller multi-stream databases. Results show that the presented approach outperforms the widely adopted synchronous hidden Markov models (HMM) trained jointly on audio and visual data of a given audio-visual database for phone recognition by 29% relative. It also outperforms the external audio models trained on extensive external audio datasets and also internal audio models by 5.5% and 46% relative respectively. We also show that the proposed approach is beneficial in noisy environments where the audio source is affected by the environmental noise.
Resumo:
Automated digital recordings are useful for large-scale temporal and spatial environmental monitoring. An important research effort has been the automated classification of calling bird species. In this paper we examine a related task, retrieval of birdcalls from a database of audio recordings, similar to a user supplied query call. Such a retrieval task can sometimes be more useful than an automated classifier. We compare three approaches to similarity-based birdcall retrieval using spectral ridge features and two kinds of gradient features, structure tensor and the histogram of oriented gradients. The retrieval accuracy of our spectral ridge method is 94% compared to 82% for the structure tensor method and 90% for the histogram of gradients method. Additionally, this approach potentially offers a more compact representation and is more computationally efficient.
Resumo:
Acoustic recordings play an increasingly important role in monitoring terrestrial and aquatic environments. However, rapid advances in technology make it possible to accumulate thousands of hours of recordings, more than ecologists can ever listen to. Our approach to this big-data challenge is to visualize the content of long-duration audio recordings on multiple scales, from minutes, hours, days to years. The visualization should facilitate navigation and yield ecologically meaningful information prior to listening to the audio. To construct images, we calculate acoustic indices, statistics that describe the distribution of acoustic energy and reflect content of ecological interest. We combine various indices to produce false-color spectrogram images that reveal acoustic content and facilitate navigation. The technical challenge we investigate in this work is how to navigate recordings that are days or even months in duration. We introduce a method of zooming through multiple temporal scales, analogous to Google Maps. However, the “landscape” to be navigated is not geographical and not therefore intrinsically visual, but rather a graphical representation of the underlying audio. We describe solutions to navigating spectrograms that range over three orders of magnitude of temporal scale. We make three sets of observations: 1. We determine that at least ten intermediate scale steps are required to zoom over three orders of magnitude of temporal scale; 2. We determine that three different visual representations are required to cover the range of temporal scales; 3. We present a solution to the problem of maintaining visual continuity when stepping between different visual representations. Finally, we demonstrate the utility of the approach with four case studies.
Resumo:
The aim of this paper is to present results of research investigating the effectiveness of audio feedback in a third year undergraduate unit. While there is a large and growing body of literature about providing assessment feedback, there is little focussing on the use of audio media. This study employs a mixed method approach, involving semi-structured interviews with academic staff and a survey of students. Analysis of the interview data suggests that there are a number of issues surrounding acceptance of using audio feedback by lecturers. The next stage of the study is to examine the extent to which lecturers change their perceptions as they use audio feedback and to analyse the perceptions of the students (n=120), including the perceived importance of feedback, the ways in which they used the audio feedback and the extent to which they believe they control events that affect them. Ultimately, this study seeks to provide recommendations appropriate to the implementation of audio feedback in higher education.
Resumo:
Providing audio feedback to assessment is relatively uncommon in higher education. However, published research suggests that it is preferred over written feedback by students but lecturers were less convinced. The aim of this paper is to examine further these findings in the context of a third year business ethics unit. Data was collected from two sources. The first is a series of in-depth, semi-structured interviews conducted with three lecturers providing audio feeback for the first time in Semester One 2011. The second source of data was drawn from the university student evaluation system. A total of 363 responses were used providing 'before' and 'after' perspectives about the effectiveness of audio feedback versus written feedback. Between 2005 and 2009 the survey data provided information about student attitudes to written assessment feedback (n=261). From 2010 onwards the data relates to audio (mp3) feedback (n=102). The analysis of he interview data indicated that introducing audio feedback should be done with care. The perception of the participating lecturers was mixed, ranging from sceptism to outright enthusiasm, but over time the overall approach became positive. It was found that particular attention needs to be paid to small (but important) technical details, and lecturers need to be convinced of its effectieness, especially that it is not necessarily more time consuming than providing written feedback. For students, the analysis revealed a clear preference for audio feedback. It is concluded that there is cause for concern and reason for optimism. It is a cause for concern because there is a possibility that scepticism on the part of academic staff seems to be based on assumptions about what students prefer and a concern about using the technology. There is reason for optimism because the evidence points towards students preferring audio feedback and as academic staff become more familiar with the technology the scepticism tends to evaporate. While this study is limited in scope, questions are raised about tackling negative staff perceptions of audio feedback that are worthy of further research.
Resumo:
This research investigates techniques to analyse long duration acoustic recordings to help ecologists monitor birdcall activities. It designs a generalized algorithm to identify a broad range of bird species. It allows ecologists to search for arbitrary birdcalls of interest, rather than restricting them to just a very limited number of species on which the recogniser is trained. The algorithm can help ecologists find sounds of interest more efficiently by filtering out large volumes of unwanted sounds and only focusing on birdcalls.
Resumo:
In recent years, many of the world’s leading media producers, screenwriters, technicians and investors, particularly those in the Asia-Pacific region, have been drawn to work in the People's Republic of China (hereafter China or Mainland China). Media projects with a lighter commercial entertainment feel – compared with the heavy propaganda-oriented content of the past – have multiplied, thanks to the Chinese state’s newfound willingness to consider collaboration with foreign partners. This is no more evident than in film. Despite their long-standing reputation for rigorous censorship, state policymakers are now encouraging Chinese media entrepreneurs to generate fresh ideas and to develop products that will revitalise the stagnant domestic production sector. It is hoped that an increase in both the quality and quantity of domestic feature films, stimulated by an infusion of creativity and cutting-edge technology from outside the country, will help reverse China’s ‘cultural trade deficit’ (wenhua maoyi chizi) (Keane 2007).
Resumo:
Communication applications are usually delay restricted, especially for the instance of musicians playing over the Internet. This requires a one-way delay of maximum 25 msec and also a high audio quality is desired at feasible bit rates. The ultra low delay (ULD) audio coding structure is well suited to this application and we investigate further the application of multistage vector quantization (MSVQ) to reach a bit rate range below 64 Kb/s, in a scalable manner. Results at 32 Kb/s and 64 Kb/s show that the trained codebook MSVQ performs best, better than KLT normalization followed by a simulated Gaussian MSVQ or simulated Gaussian MSVQ alone. The results also show that there is only a weak dependence on the training data, and that we indeed converge to the perceptual quality of our previous ULD coder at 64 Kb/s.
Resumo:
We propose a parametric stereo coding analysis and synthesis directly in the MDCT domain using an analysis by synthesis parameter estimation. The stereo signal is represented by an equalized sum signal and spatialization parameters. Equalized sum signal and the spatialization parameters are obtained by sub-band analysis in the MDCT domain. The de-correlated signal required for the stereo synthesis is also generated in the MDCT domain. Subjective evaluation test using MUSHRA shows that the synthesized stereo signal is perceptually satisfactory and comparable to the state of the art parametric coders.
Resumo:
Pre-whitening techniques are employed in blind correlation detection of additive spread spectrum watermarks in audio signals to reduce the host signal interference. A direct deterministic whitening (DDW) scheme is derived in this paper from the frequency domain analysis of the time domain correlation process. Our experimental studies reveal that, the Savitzky-Golay Whitening (SGW), which is otherwise inferior to DDW technique, performs better when the audio signal is predominantly lowpass. The novelty of this paper lies in exploiting the complementary nature to the two whitening techniques to obtain a hybrid whitening (HbW) scheme. In the hybrid scheme the DDW and SGW techniques are selectively applied, based on short time spectral characteristics of the audio signal. The hybrid scheme extends the reliability of watermark detection to a wider range of audio signals.