LZW Based Distance Measures for Spoken Language Identification


Autoria(s): Basavaraja, SV; Sreenivas, TV
Data(s)

20/11/2006

Resumo

We present a new approach to spoken language modeling for language identification (LID) using the Lempel-Ziv-Welch (LZW) algorithm. The LZW technique is applicable to any kind of tokenization of the speech signal. Because of the efficiency of LZW algorithm to obtain variable length symbol strings in the training data, the LZW codebook captures the essentials of a language effectively. We develop two new deterministic measures for LID based on the LZW algorithm namely: (i) Compression ratio score (LZW-CR) and (ii) weighted discriminant score (LZW-WDS). To assess these measures, we consider error-free tokenization of speech as well as artificially induced noise in the tokenization. It is shown that for a 6 language LID task of OGI-TS database with clean tokenization, the new model (LZW-WDS) performs slightly better than the conventional bigram model. For noisy tokenization, which is the more realistic case, LZW-WDS significantly outperforms the bigram technique

Formato

application/pdf

Identificador

http://eprints.iisc.ernet.in/42029/1/LZW_BASED_DISTANCE.pdf

Basavaraja, SV and Sreenivas, TV (2006) LZW Based Distance Measures for Spoken Language Identification. In: IEEE Odyssey 2006: The Speaker and Language Recognition Workshop, 2006., 28-30 June 2006 , San Juan.

Publicador

IEEE

Relação

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4013520&tag=1

http://eprints.iisc.ernet.in/42029/

Palavras-Chave #Electrical Communication Engineering
Tipo

Conference Paper

PeerReviewed