A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis
Data(s) |
2014
|
---|---|
Resumo |
Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation. |
Formato |
application/pdf |
Identificador | |
Idioma(s) |
eng |
Publicador |
E.T.S.I. Telecomunicación (UPM) |
Relação |
http://oa.upm.es/37500/1/INVE_MEM_2014_193698.pdf info:eu-repo/grantAgreement/EC/FP7/287678 |
Direitos |
http://creativecommons.org/licenses/by-nc-nd/3.0/es/ info:eu-repo/semantics/openAccess |
Fonte |
Proceedings 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014) | 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014) | 14/09/2014 - 18/09/2014 | Singapore |
Palavras-Chave | #Telecomunicaciones |
Tipo |
info:eu-repo/semantics/conferenceObject Ponencia en Congreso o Jornada PeerReviewed |