Language identification based on a discriminative text categorization technique


Autoria(s): Caraballo Morcillo, Miguel Ángel; D'haro Enríquez, Luis Fernando; Córdoba Herralde, Ricardo de; San Segundo Hernández, Rubén; Pardo Muñoz, José Manuel
Data(s)

2012

Resumo

In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.

Formato

application/pdf

Identificador

http://oa.upm.es/20380/

Idioma(s)

eng

Publicador

E.T.S.I. Telecomunicación (UPM)

Relação

http://oa.upm.es/20380/1/INVE_MEM_2012_134194.pdf

info:eu-repo/semantics/altIdentifier/doi/null

Direitos

http://creativecommons.org/licenses/by-nc-nd/3.0/es/

info:eu-repo/semantics/openAccess

Fonte

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop | IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop | 21/11/2012 - 22/11/2012 | Madrid, Spain

Palavras-Chave #Telecomunicaciones
Tipo

info:eu-repo/semantics/conferenceObject

Ponencia en Congreso o Jornada

PeerReviewed