192 resultados para speaker dependencies
Resumo:
In 1999 I convened Industrial Relations, the annual ADSA Conference hosted by QUT in Brisbane. This event was promoted as ‘a conference exploring the links between theatre scholarship and professional theatre practice’. As well as academics, there was to be substantial representation by ‘industry professionals’, although interest from the latter category turned out to be modest. One day of the conference was designated a special ‘Links with Industry’ day, during which the Association launched its now defunct ADSAIL (ADSA Industry Links) initiative. Keynote speaker Wesley Enoch commented on ‘the very strong resistance in “the industry” to acknowledging any role of academics’. ‘What is the practical role of having them?’ he asked the ‘them’ gathered before him. In a letter declining our invitation to speak (he later changed his mind), David Williamson remarked that he always felt ‘uneasy at such conferences’: My view of my work is that I’ve successfully filled theatres for 30 years now, something dramatists are supposed to do. I suppose there’s part of me that hopes this will be celebrated. It often is, but rarely in academic drama departments …. Perhaps in fifty years time someone in academe will realise that I wasn’t just reinforcing the attitudes of the Anglo Celtic ruling class. Several years on it seems timely to revisit Industrial Relations; to look again at the extent to which problems of intercultural communication between industry and academy are being addressed. And what are the implications of this for the ADSA History project, which seeks to investigate ADSA’s contribution to the development of theatre / performance studies in Australasia? What are the ‘external’ impacts of ADSA’s ongoing conference enterprise, and how might these be measured? Reflections from delegates on these and other questions will be warmly encouraged.
Resumo:
Purpose: The classic study of Sumby and Pollack (1954, JASA, 26(2), 212-215) demonstrated that visual information aided speech intelligibility under noisy auditory conditions. Their work showed that visual information is especially useful under low signal-to-noise conditions where the auditory signal leaves greater margins for improvement. We investigated whether simulated cataracts interfered with the ability of participants to use visual cues to help disambiguate the auditory signal in the presence of auditory noise. Methods: Participants in the study were screened to ensure normal visual acuity (mean of 20/20) and normal hearing (auditory threshold ≤ 20 dB HL). Speech intelligibility was tested under an auditory only condition and two visual conditions: normal vision and simulated cataracts. The light scattering effects of cataracts were imitated using cataract-simulating filters. Participants wore blacked-out glasses in the auditory only condition and lens-free frames in the normal auditory-visual condition. Individual sentences were spoken by a live speaker in the presence of prerecorded four-person background babble set to a speech-to-noise ratio (SNR) of -16 dB. The SNR was determined in a preliminary experiment to support 50% correct identification of sentence under the auditory only conditions. The speaker was trained to match the rate, intensity and inflections of a prerecorded audio track of everyday speech sentences. The speaker was blind to the visual conditions of the participant to control for bias.Participants’ speech intelligibility was measured by comparing the accuracy of their written account of what they believed the speaker to have said to the actual spoken sentence. Results: Relative to the normal vision condition, speech intelligibility was significantly poorer when participants wore simulated catarcts. Conclusions: The results suggest that cataracts may interfere with the acquisition of visual cues to speech perception.
Resumo:
This paper proposes a clustered approach for blind beamfoming from ad-hoc microphone arrays. In such arrangements, microphone placement is arbitrary and the speaker may be close to one, all or a subset of microphones at a given time. Practical issues with such a configuration mean that some microphones might be better discarded due to poor input signal to noise ratio (SNR) or undesirable spatial aliasing effects from large inter-element spacings when beamforming. Large inter-microphone spacings may also lead to inaccuracies in delay estimation during blind beamforming. In such situations, using a cluster of microphones (ie, a sub-array), closely located both to each other and to the desired speech source, may provide more robust enhancement than the full array. This paper proposes a method for blind clustering of microphones based on the magnitude square coherence function, and evaluates the method on a database recorded using various ad-hoc microphone arrangements.
Resumo:
With the emergence of multi-cores into the mainstream, there is a growing need for systems to allow programmers and automated systems to reason about data dependencies and inherent parallelismin imperative object-oriented languages. In this paper we exploit the structure of object-oriented programs to abstract computational side-effects. We capture and validate these effects using a static type system. We use these as the basis of sufficient conditions for several different data and task parallelism patterns. We compliment our static type system with a lightweight runtime system to allow for parallelization in the presence of complex data flows. We have a functioning compiler and worked examples to demonstrate the practicality of our solution.
Resumo:
Information fusion in biometrics has received considerable attention. The architecture proposed here is based on the sequential integration of multi-instance and multi-sample fusion schemes. This method is analytically shown to improve the performance and allow a controlled trade-off between false alarms and false rejects when the classifier decisions are statistically independent. Equations developed for detection error rates are experimentally evaluated by considering the proposed architecture for text dependent speaker verification using HMM based digit dependent speaker models. The tuning of parameters, n classifiers and m attempts/samples, is investigated and the resultant detection error trade-off performance is evaluated on individual digits. Results show that performance improvement can be achieved even for weaker classifiers (FRR-19.6%, FAR-16.7%). The architectures investigated apply to speaker verification from spoken digit strings such as credit card numbers in telephone or VOIP or internet based applications.
Resumo:
Microphone arrays have been used in various applications to capture conversations, such as in meetings and teleconferences. In many cases, the microphone and likely source locations are known \emph{a priori}, and calculating beamforming filters is therefore straightforward. In ad-hoc situations, however, when the microphones have not been systematically positioned, this information is not available and beamforming must be achieved blindly. In achieving this, a commonly neglected issue is whether it is optimal to use all of the available microphones, or only an advantageous subset of these. This paper commences by reviewing different approaches to blind beamforming, characterising them by the way they estimate the signal propagation vector and the spatial coherence of noise in the absence of prior knowledge of microphone and speaker locations. Following this, a novel clustered approach to blind beamforming is motivated and developed. Without using any prior geometrical information, microphones are first grouped into localised clusters, which are then ranked according to their relative distance from a speaker. Beamforming is then performed using either the closest microphone cluster, or a weighted combination of clusters. The clustered algorithms are compared to the full set of microphones in experiments on a database recorded on different ad-hoc array geometries. These experiments evaluate the methods in terms of signal enhancement as well as performance on a large vocabulary speech recognition task.
Resumo:
While close talking microphones give the best signal quality and produce the highest accuracy from current Automatic Speech Recognition (ASR) systems, the speech signal enhanced by microphone array has been shown to be an effective alternative in a noisy environment. The use of microphone arrays in contrast to close talking microphones alleviates the feeling of discomfort and distraction to the user. For this reason, microphone arrays are popular and have been used in a wide range of applications such as teleconferencing, hearing aids, speaker tracking, and as the front-end to speech recognition systems. With advances in sensor and sensor network technology, there is considerable potential for applications that employ ad-hoc networks of microphone-equipped devices collaboratively as a virtual microphone array. By allowing such devices to be distributed throughout the users’ environment, the microphone positions are no longer constrained to traditional fixed geometrical arrangements. This flexibility in the means of data acquisition allows different audio scenes to be captured to give a complete picture of the working environment. In such ad-hoc deployment of microphone sensors, however, the lack of information about the location of devices and active speakers poses technical challenges for array signal processing algorithms which must be addressed to allow deployment in real-world applications. While not an ad-hoc sensor network, conditions approaching this have in effect been imposed in recent National Institute of Standards and Technology (NIST) ASR evaluations on distant microphone recordings of meetings. The NIST evaluation data comes from multiple sites, each with different and often loosely specified distant microphone configurations. This research investigates how microphone array methods can be applied for ad-hoc microphone arrays. A particular focus is on devising methods that are robust to unknown microphone placements in order to improve the overall speech quality and recognition performance provided by the beamforming algorithms. In ad-hoc situations, microphone positions and likely source locations are not known and beamforming must be achieved blindly. There are two general approaches that can be employed to blindly estimate the steering vector for beamforming. The first is direct estimation without regard to the microphone and source locations. An alternative approach is instead to first determine the unknown microphone positions through array calibration methods and then to use the traditional geometrical formulation for the steering vector. Following these two major approaches investigated in this thesis, a novel clustered approach which includes clustering the microphones and selecting the clusters based on their proximity to the speaker is proposed. Novel experiments are conducted to demonstrate that the proposed method to automatically select clusters of microphones (ie, a subarray), closely located both to each other and to the desired speech source, may in fact provide a more robust speech enhancement and recognition than the full array could.
Resumo:
Traditional speech enhancement methods optimise signal-level criteria such as signal-to-noise ratio, but these approaches are sub-optimal for noise-robust speech recognition. Likelihood-maximising (LIMA) frameworks are an alternative that optimise parameters of enhancement algorithms based on state sequences generated for utterances with known transcriptions. Previous reports of LIMA frameworks have shown significant promise for improving speech recognition accuracies under additive background noise for a range of speech enhancement techniques. In this paper we discuss the drawbacks of the LIMA approach when multiple layers of acoustic mismatch are present – namely background noise and speaker accent. Experimentation using LIMA-based Mel-filterbank noise subtraction on American and Australian English in-car speech databases supports this discussion, demonstrating that inferior speech recognition performance occurs when a second layer of mismatch is seen during evaluation.
Resumo:
Hydrogel polymers are used for the manufacture of soft (or disposable) contact lenses worldwide today, but have a tendency to dehydrate on the eye. In vitro methods that can probe the potential for a given hydrogel polymer to dehydrate in vivo are much sought after. Nuclear magnetic resonance (NMR) has been shown to be effective in characterising water mobility and binding in similar systems (Barbieri, Quaglia et al., 1998, Larsen, Huff et al., 1990, Peschier, Bouwstra et al., 1993), predominantly through measurement of the spin-lattice relaxation time (T1), the spinspin relaxation time (T2) and the water diffusion coefficient (D). The aim of this work was to use NMR to quantify the molecular behaviour of water in a series of commercially available contact lens hydrogels, and relate these measurements to the binding and mobility of the water, and ultimately the potential for the hydrogel to dehydrate. As a preliminary study, in vitro evaporation rates were measured for a set of commercial contact lens hydrogels. Following this, comprehensive measurement of the temperature and water content dependencies of T1, T2 and D was performed for a series of commercial hydrogels that spanned the spectrum of equilibrium water content (EWC) and common compositions of contact lenses that are manufactured today. To quantify material differences, the data were then modelled based on theory that had been used for similar systems in the literature (Walker, Balmer et al., 1989, Hills, Takacs et al., 1989). The differences were related to differences in water binding and mobility. The evaporative results suggested that the EWC of the material was important in determining a material's potential to dehydrate in this way. Similarly, the NMR water self-diffusion coefficient was also found to be largely (if not wholly) determined by the WC. A specific binding model confirmed that the we was the dominant factor in determining the diffusive behaviour, but also suggested that subtle differences existed between the materials used, based on their equilibrium we (EWC). However, an alternative modified free volume model suggested that only the current water content of the material was important in determining the diffusive behaviour, and not the equilibrium water content. It was shown that T2 relaxation was dominated by chemical exchange between water and exchangeable polymer protons for materials that contained exchangeable polymer protons. The data was analysed using a proton exchange model, and the results were again reasonably correlated with EWC. Specifically, it was found that the average water mobility increased with increasing EWe approaching that of free water. The T1 relaxation was also shown to be reasonably well described by the same model. The main conclusion that can be drawn from this work is that the hydrogel EWe is an important parameter, which largely determines the behaviour of water in the gel. Higher EWe results in a hydrogel with water that behaves more like bulk water on average, or is less strongly 'bound' on average, compared with a lower EWe material. Based on the set of materials used, significant differences due to composition (for materials of the same or similar water content) could not be found. Similar studies could be used in the future to highlight hydrogels that deviate significantly from this 'average' behaviour, and may therefore have the least/greatest potential to dehydrate on the eye.
Resumo:
Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.
Resumo:
In this paper, we present a microphone array beamforming approach to blind speech separation. Unlike previous beamforming approaches, our system does not require a-priori knowledge of the microphone placement and speaker location, making the system directly comparable other blind source separation methods which require no prior knowledge of recording conditions. Microphone location is automatically estimated using an assumed noise field model, and speaker locations are estimated using cross correlation based methods. The system is evaluated on the data provided for the PASCAL Speech Separation Challenge 2 (SSC2), achieving a word error rate of 58% on the evaluation set.