870 resultados para Multilingual Corpus
Resumo:
In this paper we present the process of designing an efficient speech corpus for the first unit selection speech synthesis system for Bulgarian, along with some significant preliminary results regarding the quality of the resulted system. As the initial corpus is a crucial factor for the quality delivered by the Text-to-Speech system, special effort has been given in designing a complete and efficient corpus for use in a unit selection TTS system. The targeted domain of the TTS system and hence that of the corpus is the news reports, and although it is a restricted one, it is characterized by an unlimited vocabulary. The paper focuses on issues regarding the design of an optimal corpus for such a framework and the ideas on which our approach was based on. A novel multi-stage approach is presented, with special attention given to language and speaker dependent issues, as they affect the entire process. The paper concludes with the presentation of our results and the evaluation experiments, which provide clear evidence of the quality level achieved. © 2011 Springer-Verlag.
Resumo:
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (∼10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance. © 2013 IEEE.
Resumo:
Raybould, M. and Sims-Williams, P. (2007). A Corpus of Latin Inscriptions of the Roman Empire containing Celtic personal names. Aberystwyth: CMCS publications. RAE2008
Resumo:
P.M. Hastie and W. Haresign (2006). A role for LH in the regulation of expression of mRNAs encoding components of the insulin-like growth factor (IGF) system in the ovine corpus luteum. Animal Reproduction Science, 96(1-2), 196-209. Sponsorship: DEFRA RAE2008
Resumo:
The Leaving Certificate (LC) is the national, standardised state examination in Ireland necessary for entry to third level education – this presents a massive, raw corpus of data with the potential to yield invaluable insight into the phenomena of learner interlanguage. With samples of official LC Spanish examination data, this project has compiled a digitised corpus of learner Spanish comprised of the written and oral production of 100 candidates. This corpus was then analysed using a specific investigative corpus technique, Computer-aided Error Analysis (CEA, Dagneaux et al, 1998). CEA is a powerful apparatus in that it greatly facilitates the quantification and analysis of a large learner corpus in digital format. The corpus was both compiled and analysed with the use of UAM Corpus Tool (O’Donnell 2013). This Tool allows for the recording of candidate-specific variables such as grade, examination level, task type and gender, therefore allowing for critical analysis of the corpus as one unit, as separate written and oral sub corpora and also of performance per task, level and gender. This is an interdisciplinary work combining aspects of Applied Linguistics, Learner Corpus Research and Foreign Language (FL) Learning. Beginning with a review of the context of FL learning in Ireland and Europe, I go on to discuss the disciplinary context and theoretical framework for this work and outline the methodology applied. I then perform detailed quantitative and qualitative analyses before going on to combine all research findings outlining principal conclusions. This investigation does not make a priori assumptions about the data set, the LC Spanish examination, the context of FLs or of any aspect of learner competence. It undertakes to provide the linguistic research community and the domain of Spanish language learning and pedagogy in Ireland with an empirical, descriptive profile of real learner performance, characterising learner difficulty.
Resumo:
Objective Describe the methodology and selection of quality indicators (QI) to be implemented in the EFFECT (EFFectiveness of Endometrial Cancer Treatment) project. EFFECT aims to monitor the variability in Quality of Care (QoC) of uterine cancer in Belgium, to compare the effectiveness of different treatment strategies to improve the QoC and to check the internal validity of the QI to validate the impact of process indicators on outcome. Methods A QI list was retrieved from literature, recent guidelines and QI databases. The Belgian Healthcare Knowledge Center methodology was used for the selection process and involved an expert's panel rating the QI on 4 criteria. The resulting scores and further discussion resulted in a final QI list. An online EFFECT module was developed by the Belgian Cancer Registry including the list of variables required for measuring the QI. Three test phases were performed to evaluate the relevance, feasibility and understanding of the variables and to test the compatibility of the dataset. Results 138 QI were considered for further discussion and 82 QI were eligible for rating. Based on the rating scores and consensus among the expert's panel, 41 QI were considered measurable and relevant. Testing of the data collection enabled optimization of the content and the user-friendliness of the dataset and online module. Conclusions This first Belgian initiative for monitoring the QoC of uterine cancer indicates that the previously used QI selection methodology is reproducible for uterine cancer. The QI list could be applied by other research groups for comparison. © 2013 Elsevier Inc.
Resumo:
L'article présente quelques éléments de la procédure mise en place pour traiter un corpus écrit comportant 617 textes (près de 500 000 mots) relatifs aux eurorégions. Complexe et hétérogène à plusieurs titres (technique, linguistique, éditorial, générique, énonciatif), le corpus pose la difficulté majeure de l’appréhension de données multilingues (français, italien, espagnol, anglais, allemand, néerlandais). Sa manipulation a nécessité une réflexion adaptée et une démarche de modélisation que nous qualifions d’« agile » en raison de son caractère souple et itératif. La plateforme d’analyse élaborée permet de disposer de résultats utiles à l’analyse qualitative ultérieure du discours eurorégional. Elle articule un logiciel d'analyse morphosyntaxique éprouvé (TreeTagger) à des programmes (Perl) et à une base de données (SQLite) développés pour optimiser les requêtes multilingues simultanées et l’exportation automatique des résultats. Les fonctionnalités liées à la localisation contextualisée de mots- pivots, au recueil de dénominations et à la détection de segments répétés nous servent ici de guides pour exprimer les besoins de la recherche, les problèmes rencontrés et les solutions proposées. L'analyse d'observables récurrents, à savoir les notions de décision et de responsabilité, illustre le propos.
Resumo:
L’objectif de cet article est double :il s’agit, d’une part, de présenter un nouveau corpus permettant d’envisager le phénomène émergent de la communication transfrontalière en Europe et, d’autre part, de formuler trois questionnements utiles au cadrage de son analyse sémantique. À partir du corpus eurorégional – multilingue et multigenre - nous posons les questions de la dispersion des discours en ligne, de l’hétérogénéité des données et de la contextualisation de l’analyse. Notre démarche consiste à construire progressivement un modèle d’analyse adapté à l’appréhension, tantôt automatique et tantôt manuelle, de la diversité des textes. Enfin, nous proposons d’illustrer la démarche en l’appliquant à la mobilité, observable récurrent du corpus.