893 resultados para noisy speaker verification
Resumo:
We present a clustering-only approach to the problem of speaker diarization to eliminate the need for the commonly employed and computationally expensive Viterbi segmentation and realignment stage. We use multiple linear segmentations of a recording and carry out complete-linkage clustering within each segmentation scenario to obtain a set of clustering decisions for each case. We then collect all clustering decisions, across all cases, to compute a pairwise vote between the segments and conduct complete-linkage clustering to cluster them at a resolution equal to the minimum segment length used in the linear segmentations. We use our proposed cluster-voting approach to carry out speaker diarization and linking across the SAIVT-BNEWS corpus of Australian broadcast news data. We compare our technique to an equivalent baseline system with Viterbi realignment and show that our approach can outperform the baseline technique with respect to the diarization error rate (DER) and attribution error rate (AER).
Resumo:
Visual information in the form of lip movements of the speaker has been shown to improve the performance of speech recognition and search applications. In our previous work, we proposed cross database training of synchronous hidden Markov models (SHMMs) to make use of external large and publicly available audio databases in addition to the relatively small given audio visual database. In this work, the cross database training approach is improved by performing an additional audio adaptation step, which enables audio visual SHMMs to benefit from audio observations of the external audio models before adding visual modality to them. The proposed approach outperforms the baseline cross database training approach in clean and noisy environments in terms of phone recognition accuracy as well as spoken term detection (STD) accuracy.
Resumo:
Spoken term detection (STD) is the task of looking up a spoken term in a large volume of speech segments. In order to provide fast search, speech segments are first indexed into an intermediate representation using speech recognition engines which provide multiple hypotheses for each speech segment. Approximate matching techniques are usually applied at the search stage to compensate the poor performance of automatic speech recognition engines during indexing. Recently, using visual information in addition to audio information has been shown to improve phone recognition performance, particularly in noisy environments. In this paper, we will make use of visual information in the form of lip movements of the speaker in indexing stage and will investigate its effect on STD performance. Particularly, we will investigate if gains in phone recognition accuracy will carry through the approximate matching stage to provide similar gains in the final audio-visual STD system over a traditional audio only approach. We will also investigate the effect of using visual information on STD performance in different noise environments.
Resumo:
Speech recognition can be improved by using visual information in the form of lip movements of the speaker in addition to audio information. To date, state-of-the-art techniques for audio-visual speech recognition continue to use audio and visual data of the same database for training their models. In this paper, we present a new approach to make use of one modality of an external dataset in addition to a given audio-visual dataset. By so doing, it is possible to create more powerful models from other extensive audio-only databases and adapt them on our comparatively smaller multi-stream databases. Results show that the presented approach outperforms the widely adopted synchronous hidden Markov models (HMM) trained jointly on audio and visual data of a given audio-visual database for phone recognition by 29% relative. It also outperforms the external audio models trained on extensive external audio datasets and also internal audio models by 5.5% and 46% relative respectively. We also show that the proposed approach is beneficial in noisy environments where the audio source is affected by the environmental noise.
Resumo:
Piezoelectric polymers based on polyvinylidene fluoride (PVDF) are of interest as smart materials for novel space-based telescope applications. Dimensional adjustments of adaptive thin polymer films are achieved via controlled charge deposition. Predicting their long-term performance requires a detailed understanding of the piezoelectric property changes that develop during space environmental exposure. The overall materials performance is governed by a combination of chemical and physical degradation processes occurring in low Earth orbit as established by our past laboratory-based materials performance experiments (see report SAND 2005-6846). Molecular changes are primarily induced via radiative damage, and physical damage from temperature and atomic oxygen exposure is evident as depoling, loss of orientation and surface erosion. The current project extension has allowed us to design and fabricate small experimental units to be exposed to low Earth orbit environments as part of the Materials International Space Station Experiments program. The space exposure of these piezoelectric polymers will verify the observed trends and their degradation pathways, and provide feedback on using piezoelectric polymer films in space. This will be the first time that PVDF-based adaptive polymer films will be operated and exposed to combined atomic oxygen, solar UV and temperature variations in an actual space environment. The experiments are designed to be fully autonomous, involving cyclic application of excitation voltages, sensitive film position sensors and remote data logging. This mission will provide critically needed feedback on the long-term performance and degradation of such materials, and ultimately the feasibility of large adaptive and low weight optical systems utilizing these polymers in space.
Resumo:
For the problem of speaker adaptation in speech recognition, the performance depends on the availability of adaptation data. In this paper, we have compared several existing speaker adaptation methods, viz. maximum likelihood linear regression (MLLR), eigenvoice (EV), eigenspace-based MLLR (EMLLR), segmental eigenvoice (SEV) and hierarchical eigenvoice (HEV) based methods. We also develop a new method by modifying the existing HEV method for achieving further performance improvement in a limited available data scenario. In the sense of availability of adaptation data, the new modified HEV (MHEV) method is shown to perform better than all the existing methods throughout the range of operation except the case of MLLR at the availability of more adaptation data.
Resumo:
A computational algorithm (based on Smullyan's analytic tableau method) that varifies whether a given well-formed formula in propositional calculus is a tautology or not has been implemented on a DEC system 10. The stepwise refinement approch of program development used for this implementation forms the subject matter of this paper. The top-down design has resulted in a modular and reliable program package. This computational algoritlhm compares favourably with the algorithm based on the well-known resolution principle used in theorem provers.
Resumo:
Digital Image
Resumo:
In recent years, concern has arisen over the effects of increasing carbon dioxide (CO2) in the earth's atmosphere due to the burning of fossil fuels. One way to mitigate increase in atmospheric CO2 concentration and climate change is carbon sequestration to forest vegeta-tion through photosynthesis. Comparable regional scale estimates for the carbon balance of forests are therefore needed for scientific and political purposes. The aim of the present dissertation was to improve methods for quantifying and verifying inventory-based carbon pool estimates of the boreal forests in the mineral soils. Ongoing forest inventories provide a data based on statistically sounded sampling for estimating the level of carbon stocks and stock changes, but improved modelling tools and comparison of methods are still needed. In this dissertation, the entire inventory-based large-scale forest carbon stock assessment method was presented together with some separate methods for enhancing and comparing it. The enhancement methods presented here include ways to quantify the biomass of understorey vegetation as well as to estimate the litter production of needles and branches. In addition, the optical remote sensing method illustrated in this dis-sertation can be used to compare with independent data. The forest inventory-based large-scale carbon stock assessment method demonstrated here provided reliable carbon estimates when compared with independent data. Future ac-tivity to improve the accuracy of this method could consist of reducing the uncertainties regarding belowground biomass and litter production as well as the soil compartment. The methods developed will serve the needs for UNFCCC reporting and the reporting under the Kyoto Protocol. This method is principally intended for analysts or planners interested in quantifying carbon over extensive forest areas.
Resumo:
We propose a novel technique for robust voiced/unvoiced segment detection in noisy speech, based on local polynomial regression. The local polynomial model is well-suited for voiced segments in speech. The unvoiced segments are noise-like and do not exhibit any smooth structure. This property of smoothness is used for devising a new metric called the variance ratio metric, which, after thresholding, indicates the voiced/unvoiced boundaries with 75% accuracy for 0dB global signal-to-noise ratio (SNR). A novelty of our algorithm is that it processes the signal continuously, sample-by-sample rather than frame-by-frame. Simulation results on TIMIT speech database (downsampled to 8kHz) for various SNRs are presented to illustrate the performance of the new algorithm. Results indicate that the algorithm is robust even in high noise levels.
Resumo:
Notched three-point bend specimens (TPB) were tested under crack mouth opening displacement (CMOD) control at a rate of 0.0004 mm/s and the entire fracture process was simulated using a regular triangular two-dimensional lattice network only over the expected fracture proces zone width. The rest of the beam specimen was discretised by a coarse triangular finite element mesh. The discrete grain structure of the concrete was generated assuming the grains to be spherical. The load versus CMOD plots thus simulated agreed reasonably well with the experimental results. Moreover, acoustic emission (AE) hits were recorded during the test and compared with the number of fractured lattice elements. It was found that the cumulative AE hits correlated well with the cumulative fractured lattice elements at all load levels thus providing a useful means for predicting when the micro-cracks form during the fracturing process, both in the pre-peak and in the post-peak regimes.
Resumo:
We are addressing a new problem of improving automatic speech recognition performance, given multiple utterances of patterns from the same class. We have formulated the problem of jointly decoding K multiple patterns given a single Hidden Markov Model. It is shown that such a solution is possible by aligning the K patterns using the proposed Multi Pattern Dynamic Time Warping algorithm followed by the Constrained Multi Pattern Viterbi Algorithm The new formulation is tested in the context of speaker independent isolated word recognition for both clean and noisy patterns. When 10 percent of speech is affected by a burst noise at -5 dB Signal to Noise Ratio (local), it is shown that joint decoding using only two noisy patterns reduces the noisy speech recognition error rate to about 51 percent, when compared to the single pattern decoding using the Viterbi Algorithm. In contrast a simple maximization of individual pattern likelihoods, provides only about 7 percent reduction in error rate.
Resumo:
In this paper, new results and insights are derived for the performance of multiple-input, single-output systems with beamforming at the transmitter, when the channel state information is quantized and sent to the transmitter over a noisy feedback channel. It is assumed that there exists a per-antenna power constraint at the transmitter, hence, the equal gain transmission (EGT) beamforming vector is quantized and sent from the receiver to the transmitter. The loss in received signal-to-noise ratio (SNR) relative to perfect beamforming is analytically characterized, and it is shown that at high rates, the overall distortion can be expressed as the sum of the quantization-induced distortion and the channel error-induced distortion, and that the asymptotic performance depends on the error-rate behavior of the noisy feedback channel as the number of codepoints gets large. The optimum density of codepoints (also known as the point density) that minimizes the overall distortion subject to a boundedness constraint is shown to be the same as the point density for a noiseless feedback channel, i.e., the uniform density. The binary symmetric channel with random index assignment is a special case of the analysis, and it is shown that as the number of quantized bits gets large the distortion approaches the same as that obtained with random beamforming. The accuracy of the theoretical expressions obtained are verified through Monte Carlo simulations.
Resumo:
We present a case study of formal verification of full-wave rectifier for analog and mixed signal designs. We have used the Checkmate tool from CMU [1], which is a public domain formal verification tool for hybrid systems. Due to the restriction imposed by Checkmate it necessitates to make the changes in the Checkmate implementation to implement the complex and non-linear system. Full-wave rectifier has been implemented by using the Checkmate custom blocks and the Simulink blocks from MATLAB from Math works. After establishing the required changes in the Checkmate implementation we are able to efficiently verify, the safety properties of the full-wave rectifier.