268 resultados para ASR
Resumo:
Reinforced concrete structures are susceptible to a variety of deterioration mechanisms due to creep and shrinkage, alkali-silica reaction (ASR), carbonation, and corrosion of the reinforcement. The deterioration problems can affect the integrity and load carrying capacity of the structure. Substantial research has been dedicated to these various mechanisms aiming to identify the causes, reactions, accelerants, retardants and consequences. This has improved our understanding of the long-term behaviour of reinforced concrete structures. However, the strengthening of reinforced concrete structures for durability has to date been mainly undertaken after expert assessment of field data followed by the development of a scheme to both terminate continuing degradation, by separating the structure from the environment, and strengthening the structure. The process does not include any significant consideration of the residual load-bearing capacity of the structure and the highly variable nature of estimates of such remaining capacity. Development of performance curves for deteriorating bridge structures has not been attempted due to the difficulty in developing a model when the input parameters have an extremely large variability. This paper presents a framework developed for an asset management system which assesses residual capacity and identifies the most appropriate rehabilitation method for a given reinforced concrete structure exposed to aggressive environments. In developing the framework, several industry consultation sessions have been conducted to identify input data required, research methodology and output knowledge base. Capturing expert opinion in a useable knowledge base requires development of a rule based formulation, which can subsequently be used to model the reliability of the performance curve of a reinforced concrete structure exposed to a given environment.
Resumo:
The purpose of this chapter is to describe the use of caricatured contrasting scenarios (Bødker, 2000) and how they can be used to consider potential designs for disruptive technologies. The disruptive technology in this case is Automatic Speech Recognition (ASR) software in workplace settings. The particular workplace is the Magistrates Court of the Australian Capital Territory.----- Caricatured contrasting scenarios are ideally suited to exploring how ASR might be implemented in a particular setting because they allow potential implementations to be “sketched” quickly and with little effort. This sketching of potential interactions and the emphasis of both positive and negative outcomes allows the benefits and pitfalls of design decisions to become apparent.----- A brief description of the Court is given, describing the reasons for choosing the Court for this case study. The work of the Court is framed as taking place in two modes: Front of house, where the courtroom itself is, and backstage, where documents are processed and the business of the court is recorded and encoded into various systems.----- Caricatured contrasting scenarios describing the introduction of ASR to the front of house are presented and then analysed. These scenarios show that the introduction of ASR to the court would be highly problematic.----- The final section describes how ASR could be re-imagined in order to make it useful for the court. A final scenario is presented that describes how this re-imagined ASR could be integrated into both the front of house and backstage of the court in a way that could strengthen both processes.
Resumo:
Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.
Resumo:
While close talking microphones give the best signal quality and produce the highest accuracy from current Automatic Speech Recognition (ASR) systems, the speech signal enhanced by microphone array has been shown to be an effective alternative in a noisy environment. The use of microphone arrays in contrast to close talking microphones alleviates the feeling of discomfort and distraction to the user. For this reason, microphone arrays are popular and have been used in a wide range of applications such as teleconferencing, hearing aids, speaker tracking, and as the front-end to speech recognition systems. With advances in sensor and sensor network technology, there is considerable potential for applications that employ ad-hoc networks of microphone-equipped devices collaboratively as a virtual microphone array. By allowing such devices to be distributed throughout the users’ environment, the microphone positions are no longer constrained to traditional fixed geometrical arrangements. This flexibility in the means of data acquisition allows different audio scenes to be captured to give a complete picture of the working environment. In such ad-hoc deployment of microphone sensors, however, the lack of information about the location of devices and active speakers poses technical challenges for array signal processing algorithms which must be addressed to allow deployment in real-world applications. While not an ad-hoc sensor network, conditions approaching this have in effect been imposed in recent National Institute of Standards and Technology (NIST) ASR evaluations on distant microphone recordings of meetings. The NIST evaluation data comes from multiple sites, each with different and often loosely specified distant microphone configurations. This research investigates how microphone array methods can be applied for ad-hoc microphone arrays. A particular focus is on devising methods that are robust to unknown microphone placements in order to improve the overall speech quality and recognition performance provided by the beamforming algorithms. In ad-hoc situations, microphone positions and likely source locations are not known and beamforming must be achieved blindly. There are two general approaches that can be employed to blindly estimate the steering vector for beamforming. The first is direct estimation without regard to the microphone and source locations. An alternative approach is instead to first determine the unknown microphone positions through array calibration methods and then to use the traditional geometrical formulation for the steering vector. Following these two major approaches investigated in this thesis, a novel clustered approach which includes clustering the microphones and selecting the clusters based on their proximity to the speaker is proposed. Novel experiments are conducted to demonstrate that the proposed method to automatically select clusters of microphones (ie, a subarray), closely located both to each other and to the desired speech source, may in fact provide a more robust speech enhancement and recognition than the full array could.
Resumo:
In recent times, the improved levels of accuracy obtained by Automatic Speech Recognition (ASR) technology has made it viable for use in a number of commercial products. Unfortunately, these types of applications are limited to only a few of the world’s languages, primarily because ASR development is reliant on the availability of large amounts of language specific resources. This motivates the need for techniques which reduce this language-specific, resource dependency. Ideally, these approaches should generalise across languages, thereby providing scope for rapid creation of ASR capabilities for resource poor languages. Cross Lingual ASR emerges as a means for addressing this need. Underpinning this approach is the observation that sound production is largely influenced by the physiological construction of the vocal tract, and accordingly, is human, and not language specific. As a result, a common inventory of sounds exists across languages; a property which is exploitable, as sounds from a resource poor, target language can be recognised using models trained on resource rich, source languages. One of the initial impediments to the commercial uptake of ASR technology was its fragility in more challenging environments, such as conversational telephone speech. Subsequent improvements in these environments has gained consumer confidence. Pragmatically, if cross lingual techniques are to considered a viable alternative when resources are limited, they need to perform under the same types of conditions. Accordingly, this thesis evaluates cross lingual techniques using two speech environments; clean read speech and conversational telephone speech. Languages used in evaluations are German, Mandarin, Japanese and Spanish. Results highlight that previously proposed approaches provide respectable results for simpler environments such as read speech, but degrade significantly when in the more taxing conversational environment. Two separate approaches for addressing this degradation are proposed. The first is based on deriving better target language lexical representation, in terms of the source language model set. The second, and ultimately more successful approach, focuses on improving the classification accuracy of context-dependent (CD) models, by catering for the adverse influence of languages specific phonotactic properties. Whilst the primary research goal in this thesis is directed towards improving cross lingual techniques, the catalyst for investigating its use was based on expressed interest from several organisations for an Indonesian ASR capability. In Indonesia alone, there are over 200 million speakers of some Malay variant, provides further impetus and commercial justification for speech related research on this language. Unfortunately, at the beginning of the candidature, limited research had been conducted on the Indonesian language in the field of speech science, and virtually no resources existed. This thesis details the investigative and development work dedicated towards obtaining an ASR system with a 10000 word recognition vocabulary for the Indonesian language.
Resumo:
In this paper we propose a new method for utilising phase information by complementing it with traditional magnitude-only spectral subtraction speech enhancement through Complex Spectrum Subtraction (CSS). The proposed approach has the following advantages over traditional magnitude-only spectral subtraction: (a) it introduces complementary information to the enhancement algorithm; (b) it reduces the total number of algorithmic parameters, and; (c) is designed for improving clean speech magnitude spectra and is therefore suitable for both automatic speech recognition (ASR) and speech perception applications. Oracle-based ASR experiments verify this approach, showing an average of 20% relative word accuracy improvements when accurate estimates of the phase spectrum are available. Based on sinusoidal analysis and assuming stationarity between observations (which is shown to be better approximated as the frame rate is increased), this paper also proposes a novel method for acquiring the phase information called Phase Estimation via Delay Projection (PEDEP). Further oracle ASR experiments validate the potential for the proposed PEDEP technique in ideal conditions. Realistic implementation of CSS with PEDEP shows performance comparable to state of the art spectral subtraction techniques in a range of 15-20 dB signal-to-noise ratio environments. These results clearly demonstrate the potential for using phase spectra in spectral subtractive enhancement applications, and at the same time highlight the need for deriving more accurate phase estimates in a wider range of noise conditions.
Resumo:
Speaker diarization is the process of annotating an input audio with information that attributes temporal regions of the audio signal to their respective sources, which may include both speech and non-speech events. For speech regions, the diarization system also specifies the locations of speaker boundaries and assign relative speaker labels to each homogeneous segment of speech. In short, speaker diarization systems effectively answer the question of ‘who spoke when’. There are several important applications for speaker diarization technology, such as facilitating speaker indexing systems to allow users to directly access the relevant segments of interest within a given audio, and assisting with other downstream processes such as summarizing and parsing. When combined with automatic speech recognition (ASR) systems, the metadata extracted from a speaker diarization system can provide complementary information for ASR transcripts including the location of speaker turns and relative speaker segment labels, making the transcripts more readable. Speaker diarization output can also be used to localize the instances of specific speakers to pool data for model adaptation, which in turn boosts transcription accuracies. Speaker diarization therefore plays an important role as a preliminary step in automatic transcription of audio data. The aim of this work is to improve the usefulness and practicality of speaker diarization technology, through the reduction of diarization error rates. In particular, this research is focused on the segmentation and clustering stages within a diarization system. Although particular emphasis is placed on the broadcast news audio domain and systems developed throughout this work are also trained and tested on broadcast news data, the techniques proposed in this dissertation are also applicable to other domains including telephone conversations and meetings audio. Three main research themes were pursued: heuristic rules for speaker segmentation, modelling uncertainty in speaker model estimates, and modelling uncertainty in eigenvoice speaker modelling. The use of heuristic approaches for the speaker segmentation task was first investigated, with emphasis placed on minimizing missed boundary detections. A set of heuristic rules was proposed, to govern the detection and heuristic selection of candidate speaker segment boundaries. A second pass, using the same heuristic algorithm with a smaller window, was also proposed with the aim of improving detection of boundaries around short speaker segments. Compared to single threshold based methods, the proposed heuristic approach was shown to provide improved segmentation performance, leading to a reduction in the overall diarization error rate. Methods to model the uncertainty in speaker model estimates were developed, to address the difficulties associated with making segmentation and clustering decisions with limited data in the speaker segments. The Bayes factor, derived specifically for multivariate Gaussian speaker modelling, was introduced to account for the uncertainty of the speaker model estimates. The use of the Bayes factor also enabled the incorporation of prior information regarding the audio to aid segmentation and clustering decisions. The idea of modelling uncertainty in speaker model estimates was also extended to the eigenvoice speaker modelling framework for the speaker clustering task. Building on the application of Bayesian approaches to the speaker diarization problem, the proposed approach takes into account the uncertainty associated with the explicit estimation of the speaker factors. The proposed decision criteria, based on Bayesian theory, was shown to generally outperform their non- Bayesian counterparts.
Resumo:
Global Navigation Satellite Systems (GNSS)-based observation systems can provide high precision positioning and navigation solutions in real time, in the order of subcentimetre if we make use of carrier phase measurements in the differential mode and deal with all the bias and noise terms well. However, these carrier phase measurements are ambiguous due to unknown, integer numbers of cycles. One key challenge in the differential carrier phase mode is to fix the integer ambiguities correctly. On the other hand, in the safety of life or liability-critical applications, such as for vehicle safety positioning and aviation, not only is high accuracy required, but also the reliability requirement is important. This PhD research studies to achieve high reliability for ambiguity resolution (AR) in a multi-GNSS environment. GNSS ambiguity estimation and validation problems are the focus of the research effort. Particularly, we study the case of multiple constellations that include initial to full operations of foreseeable Galileo, GLONASS and Compass and QZSS navigation systems from next few years to the end of the decade. Since real observation data is only available from GPS and GLONASS systems, the simulation method named Virtual Galileo Constellation (VGC) is applied to generate observational data from another constellation in the data analysis. In addition, both full ambiguity resolution (FAR) and partial ambiguity resolution (PAR) algorithms are used in processing single and dual constellation data. Firstly, a brief overview of related work on AR methods and reliability theory is given. Next, a modified inverse integer Cholesky decorrelation method and its performance on AR are presented. Subsequently, a new measure of decorrelation performance called orthogonality defect is introduced and compared with other measures. Furthermore, a new AR scheme considering the ambiguity validation requirement in the control of the search space size is proposed to improve the search efficiency. With respect to the reliability of AR, we also discuss the computation of the ambiguity success rate (ASR) and confirm that the success rate computed with the integer bootstrapping method is quite a sharp approximation to the actual integer least-squares (ILS) method success rate. The advantages of multi-GNSS constellations are examined in terms of the PAR technique involving the predefined ASR. Finally, a novel satellite selection algorithm for reliable ambiguity resolution called SARA is developed. In summary, the study demonstrats that when the ASR is close to one, the reliability of AR can be guaranteed and the ambiguity validation is effective. The work then focuses on new strategies to improve the ASR, including a partial ambiguity resolution procedure with a predefined success rate and a novel satellite selection strategy with a high success rate. The proposed strategies bring significant benefits of multi-GNSS signals to real-time high precision and high reliability positioning services.
Resumo:
Reliability of carrier phase ambiguity resolution (AR) of an integer least-squares (ILS) problem depends on ambiguity success rate (ASR), which in practice can be well approximated by the success probability of integer bootstrapping solutions. With the current GPS constellation, sufficiently high ASR of geometry-based model can only be achievable at certain percentage of time. As a result, high reliability of AR cannot be assured by the single constellation. In the event of dual constellations system (DCS), for example, GPS and Beidou, which provide more satellites in view, users can expect significant performance benefits such as AR reliability and high precision positioning solutions. Simply using all the satellites in view for AR and positioning is a straightforward solution, but does not necessarily lead to high reliability as it is hoped. The paper presents an alternative approach that selects a subset of the visible satellites to achieve a higher reliability performance of the AR solutions in a multi-GNSS environment, instead of using all the satellites. Traditionally, satellite selection algorithms are mostly based on the position dilution of precision (PDOP) in order to meet accuracy requirements. In this contribution, some reliability criteria are introduced for GNSS satellite selection, and a novel satellite selection algorithm for reliable ambiguity resolution (SARA) is developed. The SARA algorithm allows receivers to select a subset of satellites for achieving high ASR such as above 0.99. Numerical results from a simulated dual constellation cases show that with the SARA procedure, the percentages of ASR values in excess of 0.99 and the percentages of ratio-test values passing the threshold 3 are both higher than those directly using all satellites in view, particularly in the case of dual-constellation, the percentages of ASRs (>0.99) and ratio-test values (>3) could be as high as 98.0 and 98.5 % respectively, compared to 18.1 and 25.0 % without satellite selection process. It is also worth noting that the implementation of SARA is simple and the computation time is low, which can be applied in most real-time data processing applications.
Resumo:
Speech recognition in car environments has been identified as a valuable means for reducing driver distraction when operating noncritical in-car systems. Under such conditions, however, speech recognition accuracy degrades significantly, and techniques such as speech enhancement are required to improve these accuracies. Likelihood-maximizing (LIMA) frameworks optimize speech enhancement algorithms based on recognized state sequences rather than traditional signal-level criteria such as maximizing signal-to-noise ratio. LIMA frameworks typically require calibration utterances to generate optimized enhancement parameters that are used for all subsequent utterances. Under such a scheme, suboptimal recognition performance occurs in noise conditions that are significantly different from that present during the calibration session – a serious problem in rapidly changing noise environments out on the open road. In this chapter, we propose a dialog-based design that allows regular optimization iterations in order to track the ever-changing noise conditions. Experiments using Mel-filterbank noise subtraction (MFNS) are performed to determine the optimization requirements for vehicular environments and show that minimal optimization is required to improve speech recognition, avoid over-optimization, and ultimately assist with semireal-time operation. It is also shown that the proposed design is able to provide improved recognition performance over frameworks incorporating a calibration session only.
Resumo:
Precise satellite orbit and clocks are essential for providing high accuracy real-time PPP (Precise Point Positioning) service. However, by treating the predicted orbits as fixed, the orbital errors may be partially assimilated by the estimated satellite clock and hence impact the positioning solutions. This paper presents the impact analysis of errors in radial and tangential orbital components on the estimation of satellite clocks and PPP through theoretical study and experimental evaluation. The relationship between the compensation of the orbital errors by the satellite clocks and the satellite-station geometry is discussed in details. Based on the satellite clocks estimated with regional station networks of different sizes (∼100, ∼300, ∼500 and ∼700 km in radius), results indicated that the orbital errors compensated by the satellite clock estimates reduce as the size of the network increases. An interesting regional PPP mode based on the broadcast ephemeris and the corresponding estimated satellite clocks is proposed and evaluated through the numerical study. The impact of orbital errors in the broadcast ephemeris has shown to be negligible for PPP users in a regional network of a radius of ∼300 km, with positioning RMS of about 1.4, 1.4 and 3.7 cm for east, north and up component in the post-mission kinematic mode, comparable with 1.3, 1.3 and 3.6 cm using the precise orbits and the corresponding estimated clocks. Compared with the DGPS and RTK positioning, only the estimated satellite clocks are needed to be disseminated to PPP users for this approach. It can significantly alleviate the communication burdens and therefore can be beneficial to the real time applications.
Resumo:
Harmful algal blooms (HABs) are truly global marine phenomena of increasing significance. Some HAB occurrences are different to observe because of their high spatial and temporal variability and their advection, once formed, by surface currents. A serious HAB occurred in the Bohai Sea during autumn 1998, causing the largest fisheries economic loss. The present study analyzes the formation, distribution, and advection of HAB using satellite SeaWiFS ocean color data and other oceanographic data. The results show that the bloom originated in the western coastal waters of the Bohai Sea in early September, and developed southeastward when sea surface temperature (SST) increased to 25-26 °C. The bloom with a high Chl-a concentration (6.5 mg m-3) in center portion covered an area of 60 × 65 km2. At the end of September, the bloom decayed when SST decreased to 22-23 °C. The HAB may have been initiated by a combination of the river discharge nutrients in the west coastal waters and the increase of SST; afterwards it may have been transported eastward by the local circulation that was enhanced by northwesterly winds in late September and early October.
Resumo:
We are addressing the problem of jointly using multiple noisy speech patterns for automatic speech recognition (ASR), given that they come from the same class. If the user utters a word K times, the ASR system should try to use the information content in all the K patterns of the word simultaneously and improve its speech recognition accuracy compared to that of the single pattern based speech recognition. T address this problem, recently we proposed a Multi Pattern Dynamic Time Warping (MPDTW) algorithm to align the K patterns by finding the least distortion path between them. A Constrained Multi Pattern Viterbi algorithm was used on this aligned path for isolated word recognition (IWR). In this paper, we explore the possibility of using only the MPDTW algorithm for IWR. We also study the properties of the MPDTW algorithm. We show that using only 2 noisy test patterns (10 percent burst noise at -5 dB SNR) reduces the noisy speech recognition error rate by 37.66 percent when compared to the single pattern recognition using the Dynamic Time Warping algorithm.