870 resultados para Multilingual Corpus
Resumo:
Europarl is a large multilingual corpus containing the minutes of the debates at the European Parliament. This article presents a method to extract different corpora from Europarl: monolingual and multilingual comparable corpora, as well as parallel corpora. Using state-of-the-art measures of homogeneity, we show that these corpora are very similar. In addition, we argue that they present many advantages for research in various fields of linguistics and translation studies, and we also discuss some of their limitations. We conclude by reviewing a number of previous studies that made use of these corpora, emphasizing in each case the possibilities offered by Europarl.
Resumo:
From a multilingual corpus composed of institutional websites published by the Euroregions, this article attempts to explore the discursive indications which tend to legitimize the notion of Euroregional space. The lexicometric and qualitative observations lead to a compliance to the normative objectives of the European politics and to the standardization of the euroregional representations, in spite of languages and issuing cross-border areas.
Resumo:
The aim of this paper is to describe the use that professional translators make of corpora as translation resources. First, we briefly review the literature on translation practitioners’ use of corpora in the contexts of both translation training and professional translation. Then we present our survey-based study, analyse the uptake of corpora among Spanish translators and describe the use of this kind of translation resource. The results show that even if corpora are not as frequently used as other kinds of resources, such as dictionaries, there are professional translators who do use corpora, in a variety of ways, in their work. Additionally, non-users do not seem entirely sceptical about corpora. Against that backdrop, translator trainers are invited to continue to report on how corpora can be used as translation resources.
Resumo:
This article briefly reviews multilingual language resources for Bulgarian, developed in the frame of some international projects: the first-ever annotated Bulgarian MTE digital lexical resources, Bulgarian-Polish corpus, Bulgarian-Slovak parallel and aligned corpus, and Bulgarian-Polish-Lithuanian corpus. These resources are valuable multilingual dataset for language engineering research and development for Bulgarian language. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures.
Resumo:
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.
Resumo:
Em Portugal, o turismo é uma actividade económica que gera ganhos significativos e a promoção turística do país no mercado externo assenta cada vez mais na criação de sites multilingues. Este artigo examina um corpus constituído por textos provenientes de sites de Regiões de Turismo de Portugal, em português, e as respectivas traduções para inglês, com o objectivo de demonstrar o modo como os tradutores adicionam informação inexistente no texto original. Através da análise desta característica específica dos sites oficiais traduzidos para promover o destino ―Portugal‖ no mercado externo pretende salientar-se a importância que as estratégias de tradução assumem no marketing do destino turístico, uma vez que a informação adicionada cria uma determinada imagem de uma região. Em termos teóricos e metodológicos, este artigo enquadra-se no âmbito da Linguística de Corpus.
Resumo:
Em Portugal, o turismo é uma actividade económica que gera ganhos significativos e a promoção turística do país no mercado externo assenta cada vez mais na criação de sites multilingues. Este artigo examina um corpus constituído por textos provenientes de sites de Regiões de Turismo de Portugal, em português, e as respectivas traduções para inglês, com o objectivo de demonstrar o modo como os tradutores adicionam informação inexistente no texto original. Através da análise desta característica específica dos sites oficiais traduzidos para promover o destino ―Portugal‖ no mercado externo pretende salientar-se a importância que as estratégias de tradução assumem no marketing do destino turístico, uma vez que a informação adicionada cria uma determinada imagem de uma região. Em termos teóricos e metodológicos, este artigo enquadra-se no âmbito da Linguística de Corpus.
Resumo:
Following the internationalization of contemporary higher education, academic institutions based in non-English speaking countries are increasingly urged to produce contents in English to address international prospective students and personnel, as well as to increase their attractiveness. The demand for English translations in the institutional academic domain is consequently increasing at a rate exceeding the capacity of the translation profession. Resources for assisting non-native authors and translators in the production of appropriate texts in L2 are therefore required in order to help academic institutions and professionals streamline their translation workload. Some of these resources include: (i) parallel corpora to train machine translation systems and multilingual authoring tools; and (ii) translation memories for computer-aided tools. The purpose of this study is to create and evaluate reference resources like the ones mentioned in (i) and (ii) through the automatic sentence alignment of a large set of Italian and English as a Lingua Franca (ELF) institutional academic texts given as equivalent but not necessarily parallel (i.e. translated). In this framework, a set of aligning algorithms and alignment tools is examined in order to identify the most profitable one(s) in terms of accuracy and time- and cost-effectiveness. In order to determine the text pairs to align, a sample is selected according to document length similarity (characters) and subsequently evaluated in terms of extent of noisiness/parallelism, alignment accuracy and content leverageability. The results of these analyses serve as the basis for the creation of an aligned bilingual corpus of academic course descriptions, which is eventually used to create a translation memory in TMX format.
Resumo:
Discourse connectives are lexical items indicating coherence relations between discourse segments. Even though many languages possess a whole range of connectives, important divergences exist cross-linguistically in the number of connectives that are used to express a given relation. For this reason, connectives are not easily paired with a univocal translation equivalent across languages. This paper is a first attempt to design a reliable method to annotate the meaning of discourse connectives cross-linguistically using corpus data. We present the methodological choices made to reach this aim and report three annotation experiments using the framework of the Penn Discourse Tree Bank.
Resumo:
This paper presents the automatic extension to other languages of TERSEO, a knowledge-based system for the recognition and normalization of temporal expressions originally developed for Spanish. TERSEO was first extended to English through the automatic translation of the temporal expressions. Then, an improved porting process was applied to Italian, where the automatic translation of the temporal expressions from English and from Spanish was combined with the extraction of new expressions from an Italian annotated corpus. Experimental results demonstrate how, while still adhering to the rule-based paradigm, the development of automatic rule translation procedures allowed us to minimize the effort required for porting to new languages. Relying on such procedures, and without any manual effort or previous knowledge of the target language, TERSEO recognizes and normalizes temporal expressions in Italian with good results (72% precision and 83% recall for recognition).
Resumo:
The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation.