961 resultados para Perceptual Speech Evaluation
Resumo:
In recognition-based user interface, users’ satisfaction is determined not only by recognition accuracy but also by effort to correct recognition errors. In this paper, we introduce a crossmodal error correction technique, which allows users to correct errors of Chinese handwriting recognition by speech. The focus of the paper is a multimodal fusion algorithm supporting the crossmodal error correction. By fusing handwriting and speech recognition, the algorithm can correct errors in both character extraction and recognition of handwriting. The experimental result indicates that the algorithm is effective and efficient. Moreover, the evaluation also shows the correction technique can help users to correct errors in handwriting recognition more efficiently than the other two error correction techniques.
Resumo:
Background: Ototoxicity is a known side effect of combined radiation therapy and cisplatin chemotherapy for the treatment of medulloblastoma. the delivery of an involved field boost by intensity modulated radiation therapy (IMRT) may reduce the dose to the inner ear when compared with conventional radiotherapy. the dose of cisplatin may also affect the risk of ototoxicity. A retrospective study was performed to evaluate the impact of involved field boost using IMRT and cisplatin dose on the rate of ototoxicity.Methods: Data from 41 medulloblastoma patients treated with IMRT were collected. Overall and disease-free survival rates were calculated by Kaplan-Meier method Hearing function was graded according to toxicity criteria of Pediatric Oncology Group (POG). Doses to inner ear and total cisplatin dose were correlated with hearing function by univariate and multivariate data analysis.Results: After a mean follow-up of 44 months (range: 14 to 72 months), 37 patients remained alive, with two recurrences, both in spine with CSF involvement, resulting in a disease free-survival and overall survival of 85.2% and 90.2%, respectively. Seven patients (17%) experienced POG Grade 3 or 4 toxicity. Cisplatin dose was a significant factor for hearing loss in univariate analysis (p < 0.03). in multivariate analysis, median dose to inner ear was significantly associated with hearing loss (p < 0.01). POG grade 3 and 4 toxicity were uncommon with median doses to the inner ear bellow 42 Gy (p < 0.05) and total cisplatin dose of less than 375 mg/m(2) (p < 0.01).Conclusions: IMRT leads to a low rate of severe ototoxicity. Median radiation dose to auditory apparatus should be kept below 42 Gy. Cisplatin doses should not exceed 375 mg/m(2).
Resumo:
The original article is available as an open access file on the Springer website in the following link: http://link.springer.com/article/10.1007/s10639-015-9388-2
Resumo:
This paper provides a summary of our studies on robust speech recognition based on a new statistical approach – the probabilistic union model. We consider speech recognition given that part of the acoustic features may be corrupted by noise. The union model is a method for basing the recognition on the clean part of the features, thereby reducing the effect of the noise on recognition. To this end, the union model is similar to the missing feature method. However, the two methods achieve this end through different routes. The missing feature method usually requires the identity of the noisy data for noise removal, while the union model combines the local features based on the union of random events, to reduce the dependence of the model on information about the noise. We previously investigated the applications of the union model to speech recognition involving unknown partial corruption in frequency band, in time duration, and in feature streams. Additionally, a combination of the union model with conventional noise-reduction techniques was studied, as a means of dealing with a mixture of known or trainable noise and unknown unexpected noise. In this paper, a unified review, in the context of dealing with unknown partial feature corruption, is provided into each of these applications, giving the appropriate theory and implementation algorithms, along with an experimental evaluation.
Resumo:
In this paper, a novel video-based multimodal biometric verification scheme using the subspace-based low-level feature fusion of face and speech is developed for specific speaker recognition for perceptual human--computer interaction (HCI). In the proposed scheme, human face is tracked and face pose is estimated to weight the detected facelike regions in successive frames, where ill-posed faces and false-positive detections are assigned with lower credit to enhance the accuracy. In the audio modality, mel-frequency cepstral coefficients are extracted for voice-based biometric verification. In the fusion step, features from both modalities are projected into nonlinear Laplacian Eigenmap subspace for multimodal speaker recognition and combined at low level. The proposed approach is tested on the video database of ten human subjects, and the results show that the proposed scheme can attain better accuracy in comparison with the conventional multimodal fusion using latent semantic analysis as well as the single-modality verifications. The experiment on MATLAB shows the potential of the proposed scheme to attain the real-time performance for perceptual HCI applications.
Resumo:
Across languages, children with developmental dyslexia have a specific difficulty with the neural representation of the sound structure (phonological structure) of speech. One likely cause of their difficulties with phonology is a perceptual difficulty in auditory temporal processing (Tallal, 1980). Tallal (1980) proposed that basic auditory processing of brief, rapidly successive acoustic changes is compromised in dyslexia, thereby affecting phonetic discrimination (e.g. discriminating /b/ from /d/) via impaired discrimination of formant transitions (rapid acoustic changes in frequency and intensity). However, an alternative auditory temporal hypothesis is that the basic auditory processing of the slower amplitude modulation cues in speech is compromised (Goswami , 2002). Here, we contrast children's perception of a synthetic speech contrast (ba/wa) when it is based on the speed of the rate of change of frequency information (formant transition duration) versus the speed of the rate of change of amplitude modulation (rise time). We show that children with dyslexia have excellent phonetic discrimination based on formant transition duration, but poor phonetic discrimination based on envelope cues. The results explain why phonetic discrimination may be allophonic in developmental dyslexia (Serniclaes , 2004), and suggest new avenues for the remediation of developmental dyslexia. © 2010 Blackwell Publishing Ltd.
Resumo:
A modified comb filtering technique is proposed which can be used to reduce framing noise generated when speech signals are transform-coded or vector-quantized. Application of this filter to 9. 6 kbit/s speech in a vector transform coder has been found to improve the perceptual quality of the coded speech.
Resumo:
Research has been undertaken to investigate the use of artificial neural network (ANN) techniques to improve the performance of a low bit-rate vector transform coder. Considerable improvements in the perceptual quality of the coded speech have been obtained. New ANN-based methods for vector quantiser (VQ) design and for the adaptive updating of VQ codebook are introduced for use in speech coding applications.
Resumo:
The subjective performance of the G. 722 7-kHz wideband speech coding recommendation using music signals is described. A number of audible distortions specific to music signals were found to be present in real-time evaluations of the coder. As a result, three modifications are proposed which are found to improve the performance for music signals. These modifications are compatible with the G. 722 system configuration. Modifications made to G. 722 to alleviate the most serious aspects of the noise modulation are described: (1) an adaptive bit allocation scheme is used to reduce short and long-term nonoptimality; (2) spectral noise shaping is incorporated, significantly enhancing the subjective performance of certain modes; and (3) backward block adaptive prediction is used.
Resumo:
The subjective performance of the G. 722 7-kHz wideband speech-coding recommendation using music signals is described. A number of audible distortions specific to music signals were found to be present in real-time evaluations of the coder. As a result, three modifications are proposed which are found to improve the performance for music signals. These modifications are compatible with the G. 722 system configuration. The results obtained clearly demonstrate the very high coding efficiency of subband ADPCM (adaptive differential pulse-code modulation) with comparison to digitally companding and ADM schemes when applied to music signals.
Resumo:
This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.
Resumo:
Motivados pelo propósito central de contribuir para a construção, a longo prazo, de um sistema completo de conversão de texto para fala, baseado em síntese articulatória, desenvolvemos um modelo linguístico para o português europeu (PE), com base no sistema TADA (TAsk Dynamic Application), que visou a obtenção automática da trajectória dos articuladores a partir do texto de entrada. A concretização deste objectivo ditou o desenvolvimento de um conjunto de tarefas, nomeadamente 1) a implementação e avaliação de dois sistemas de silabificação automática e de transcrição fonética, tendo em vista a transformação do texto de entrada num formato adequado ao TADA; 2) a criação de um dicionário gestual para os sons do PE, de modo a que cada fone obtido à saída do conversor grafema-fone pudesse ter correspondência com um conjunto de gestos articulatórios adaptados para o PE; 3) a análise do fenómeno da nasalidade à luz dos princípios dinâmicos da Fonologia Articulatória (FA), com base num estudo articulatório e perceptivo. Os dois algoritmos de silabificação automática implementados e testados fizeram apelo a conhecimentos de natureza fonológica sobre a estrutura da sílaba, sendo o primeiro baseado em transdutores de estados finitos e o segundo uma implementação fiel das propostas de Mateus & d'Andrade (2000). O desempenho destes algoritmos – sobretudo do segundo – mostrou-se similar ao de outros sistemas com as mesmas potencialidades. Quanto à conversão grafema-fone, seguimos uma metodologia baseada em regras de reescrita combinada com uma técnica de aprendizagem automática. Os resultados da avaliação deste sistema motivaram a exploração posterior de outros métodos automáticos, procurando também avaliar o impacto da integração de informação silábica nos sistemas. A descrição dinâmica dos sons do PE, ancorada nos princípios teóricos e metodológicos da FA, baseou-se essencialmente na análise de dados de ressonância magnética, a partir dos quais foram realizadas todas as medições, com vista à obtenção de parâmetros articulatórios quantitativos. Foi tentada uma primeira validação das várias configurações gestuais propostas, através de um pequeno teste perceptual, que permitiu identificar os principais problemas subjacentes à proposta gestual. Este trabalho propiciou, pela primeira vez para o PE, o desenvolvimento de um primeiro sistema de conversão de texto para fala, de base articulatória. A descrição dinâmica das vogais nasais contou, quer com os dados de ressonância magnética, para caracterização dos gestos orais, quer com os dados obtidos através de articulografia electromagnética (EMA), para estudo da dinâmica do velo e da sua relação com os restantes articuladores. Para além disso, foi efectuado um teste perceptivo, usando o TADA e o SAPWindows, para avaliar a sensibilidade dos ouvintes portugueses às variações na altura do velo e alterações na coordenação intergestual. Este estudo serviu de base a uma interpretação abstracta (em termos gestuais) das vogais nasais do PE e permitiu também esclarecer aspectos cruciais relacionados com a sua produção e percepção.
Resumo:
Picture Exchange Communication System (PECS) is an augmentative and alternative communicative system that improves communication and decreases problem behaviors in children with Developmental Disabilities and Autism. The mediator model is a validated approach that clinicians use to train parents to perform evidence-based interventions. Parental non-adherence to treatment recommendations is a documented problem. This qualitative study investigated clinician-perceived factors that influence parental adherence to PECS recommendations. Three focus groups (n=8) were conducted with Speech Language Pathologists and Behavior Therapists experienced in providing parents with PECS recommendations. Constant comparison analysis was used. In general, clinicians believed that PECS was complex to implement. Thirty-one bridges were identified to overcome complexity. Twenty-two barriers and 6 other factors also impacted parental adherence. Strategies to address these factors were proposed based on a review of the literature. Future research will be performed to validate these findings using parents and a larger sample size.
Resumo:
This thesis investigated the potential use of Linear Predictive Coding in speech communication applications. A Modified Block Adaptive Predictive Coder is developed, which reduces the computational burden and complexity without sacrificing the speech quality, as compared to the conventional adaptive predictive coding (APC) system. For this, changes in the evaluation methods have been evolved. This method is as different from the usual APC system in that the difference between the true and the predicted value is not transmitted. This allows the replacement of the high order predictor in the transmitter section of a predictive coding system, by a simple delay unit, which makes the transmitter quite simple. Also, the block length used in the processing of the speech signal is adjusted relative to the pitch period of the signal being processed rather than choosing a constant length as hitherto done by other researchers. The efficiency of the newly proposed coder has been supported with results of computer simulation using real speech data. Three methods for voiced/unvoiced/silent/transition classification have been presented. The first one is based on energy, zerocrossing rate and the periodicity of the waveform. The second method uses normalised correlation coefficient as the main parameter, while the third method utilizes a pitch-dependent correlation factor. The third algorithm which gives the minimum error probability has been chosen in a later chapter to design the modified coder The thesis also presents a comparazive study beh-cm the autocorrelation and the covariance methods used in the evaluaiicn of the predictor parameters. It has been proved that the azztocorrelation method is superior to the covariance method with respect to the filter stabf-it)‘ and also in an SNR sense, though the increase in gain is only small. The Modified Block Adaptive Coder applies a switching from pitch precitzion to spectrum prediction when the speech segment changes from a voiced or transition region to an unvoiced region. The experiments cont;-:ted in coding, transmission and simulation, used speech samples from .\£=_‘ajr2_1a:r1 and English phrases. Proposal for a speaker reecgnifion syste: and a phoneme identification system has also been outlized towards the end of the thesis.