A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis


Autoria(s): Gallardo Antolín, Ascensión; Montero Martínez, Juan Manuel; King, Simon
Data(s)

2014

Resumo

Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.

Formato

application/pdf

Identificador

http://oa.upm.es/37500/

Idioma(s)

eng

Publicador

E.T.S.I. Telecomunicación (UPM)

Relação

http://oa.upm.es/37500/1/INVE_MEM_2014_193698.pdf

info:eu-repo/grantAgreement/EC/FP7/287678

Direitos

http://creativecommons.org/licenses/by-nc-nd/3.0/es/

info:eu-repo/semantics/openAccess

Fonte

Proceedings 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014) | 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014) | 14/09/2014 - 18/09/2014 | Singapore

Palavras-Chave #Telecomunicaciones
Tipo

info:eu-repo/semantics/conferenceObject

Ponencia en Congreso o Jornada

PeerReviewed