Incorporating visual information for spoken term detection


Autoria(s): Kalantari, Shahram; Dean, David; Sridharan, Sridha
Data(s)

2015

Resumo

Spoken term detection (STD) is the task of looking up a spoken term in a large volume of speech segments. In order to provide fast search, speech segments are first indexed into an intermediate representation using speech recognition engines which provide multiple hypotheses for each speech segment. Approximate matching techniques are usually applied at the search stage to compensate the poor performance of automatic speech recognition engines during indexing. Recently, using visual information in addition to audio information has been shown to improve phone recognition performance, particularly in noisy environments. In this paper, we will make use of visual information in the form of lip movements of the speaker in indexing stage and will investigate its effect on STD performance. Particularly, we will investigate if gains in phone recognition accuracy will carry through the approximate matching stage to provide similar gains in the final audio-visual STD system over a traditional audio only approach. We will also investigate the effect of using visual information on STD performance in different noise environments.

Formato

application/pdf

Identificador

http://eprints.qut.edu.au/86034/

Publicador

International Speech Communication Association

Relação

http://eprints.qut.edu.au/86034/1/Audio%20visual%20STD.pdf

http://www.isca-speech.org/archive/interspeech_2015/i15_0558.html

Kalantari, Shahram, Dean, David, & Sridharan, Sridha (2015) Incorporating visual information for spoken term detection. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Interspeech 2015, International Speech Communication Association, Maritim International Congress Center, Dresden, Germany, pp. 558-562.

Direitos

Copyright 2015 [Please consult the author]

Fonte

School of Electrical Engineering & Computer Science; Science & Engineering Faculty

Palavras-Chave #Spoken term detection #keyword spotting #audio visual phone recognition #DMLS system
Tipo

Conference Paper