873 resultados para Parallel Corpus
Resumo:
In this paper, we question the homogeneity of a large parallel corpus by measuring the similarity between various sub-parts. We compare results obtained using a general measure of lexical similarity based on χ2 and by counting the number of discourse connectives. We argue that discourse connectives provide a more sensitive measure, revealing differences that are not visible with the general measure. We also provide evidence for the existence of specific characteristics defining translated texts as opposed to non-translated ones, due to a universal tendency for explicitation.
Resumo:
False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the words’ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
Resumo:
The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation.
Resumo:
The goal of this dissertation is both to expand EPTIC, the European Parliament Translation and Interpreting Corpus containing EU Parliament plenary speeches in five different languages, and to carry out a case study on the corpus. The corpus was expanded by adding 52 new speeches in both oral and written form in a language that was not hitherto represented in the corpus i.e., Finnish. The case study focuses on the analysis of the English structure “head noun + of + modifier” interpreted into Finnish with the use of the genitive case. As for several previous case studies, this dissertation shows the potential of a corpus such as EPTIC, despite its limited size. It can be used to expand research in the fields of translation and interpreting, but also for didactic purposes. The dissertation is divided into three chapters. The first offers a theoretical background, by presenting the notion of “corpus,” the different types of corpora – particularly focusing on intermodal corpora – and an overview of corpus-based translation and interpreting studies. The second chapter focuses on the EPTIC corpus i.e., on its development and structure, and it then describes all phases of the construction of the corpus. Finally, the third chapter presents the case study, which is introduced by a description of the genitive case in Finnish and of several strategies used by interpreters to face certain difficulties in simultaneous interpreting. The case study highlights two dimensions of the EPTIC corpus. Each original speech was compared with its interpreted version (parallel dimension), and each interpreted speech was compared with its verbatim report, the written version of the oral speech (intermodal dimension). The results confirm the initial assumption of higher accuracy in translation compared to interpreting of the “of structure” from English into Finnish. Moreover, the use of the genitive case in Finnish is higher among translators than interpreters.
Resumo:
Considerando a língua como um produto da sociedade, mas também como um meio fundamental para o estabelecimento de relações entre os homens, procuramos perceber o seu lugar na sociedade globalizada, com o objectivo de desenvolver uma metodologia de análise terminológica que contribua para uma maior qualidade da comunicação especializada na sociedade em rede. Este trabalho está organizado em duas partes, sendo a primeira dedicada à reflexão sobre o papel da língua na sociedade em rede, focando questões essenciais em torno da tensão existente entre o multilinguismo e a hegemonia do inglês enquanto lingua franca, sobretudo no espaço europeu. Interessa-nos, por um lado, reflectir sobre a definição de políticas linguísticas, concretamente na Europa multilingue dos 28, e, por outro, salientar o papel preponderante que a língua tem na transmissão do conhecimento. A segunda parte deste trabalho concretiza a investigação efectuada na primeira com base na análise do relato financeiro, um domínio do saber que não só é inerentemente multilingue ¾ porque a sua aplicação é transnacional ¾ mas também reflecte a tensão identificada na primeira parte, na medida em que o inglês assume, no mundo dos negócios em geral e nos mercados financeiros em particular, o papel hegemónico de lingua franca. A abordagem terminológica que defendemos é semasiológica para fins onomasiológicos, pelo que partimos da análise do texto de especialidade, organizado em corpora de especialidade. Discutimos subsequentemente os resultados da nossa análise com os especialistas que os irão validar e cuja colaboração em diversos vi momentos do processo de análise terminológica e conceptual é fundamental para garantir a qualidade dos recursos terminológicos produzidos. Nesta óptica, exploramos um corpus de textos legislativos no âmbito do Sistema de Normalização Contabilística (SNC), de modo a delinearmos uma metodologia de trabalho que, no futuro, conduzirá à construção de uma base de dados terminológica do relato financeiro. Concomitantemente, efectuamos também um estudo sobre a Estrutura Conceptual do SNC, para o qual elaboramos uma comparação ao nível da tradução especializada no relato financeiro, com base num corpus paralelo composto pela legislação contabilística internacional endossada pela União Europeia. Utilizamos o corpus paralelo constituído por textos redigidos originalmente em inglês e traduzidos para português, em articulação com o corpus de especialidade criado com a legislação relativa ao normativo contabilístico português, para testar uma metodologia de extracção de equivalentes. Defendemos, por fim, que a harmonização no relato financeiro para além de se reger por políticas contabilísticas comuns, deve ter subjacentes questões terminológicas. É necessário, portanto, harmonizar a terminologia do relato financeiro, possibilitando aos especialistas uma comunicação em português isenta da interferência do inglês herdado das normas internacionais, através dos dois processos que identificamos: a tradução e a adaptação das Normas Internacionais de Contabilidade.
Resumo:
Dissertação de mestrado em Português Língua Não Materna (MPLNM) Português Língua Estrangeira (PLE) e Língua Segunda (PL2)
Resumo:
Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indépendants. Pourtant, les termes qui apparaissent dans le même contexte sont souvent dépendants. L’absence de la prise en compte de ces dépendances est une des causes de l’introduction de bruit dans le résultat (résultat non pertinents). Certaines études ont proposé d’intégrer certains types de dépendance, tels que la proximité, la cooccurrence, la contiguïté et de la dépendance grammaticale. Dans la plupart des cas, les modèles de dépendance sont construits séparément et ensuite combinés avec le modèle traditionnel de mots avec une importance constante. Par conséquent, ils ne peuvent pas capturer correctement la dépendance variable et la force de dépendance. Par exemple, la dépendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thèse, nous étudions différentes approches pour capturer les relations des termes et de leurs forces de dépendance. Nous avons proposé des méthodes suivantes: ─ Nous réexaminons l'approche de combinaison en utilisant différentes unités d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous étudions la possibilité d'utiliser bi-gramme et uni-gramme comme unité de traduction pour le chinois. Plusieurs modèles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallèle. Une requête en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considérons la dépendance entre les termes en utilisant la théorie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considérée comme représentant l'ensemble de tous les termes constituants. La probabilité est assignée à un tel ensemble de termes plutôt qu’a chaque terme individuel. Au moment d’évaluation de requête, cette probabilité est redistribuée aux termes de la requête si ces derniers sont différents. Cette approche nous permet d'intégrer les relations de dépendance entre les termes. Nous proposons un modèle discriminant pour intégrer les différentes types de dépendance selon leur force et leur utilité pour la RI. Notamment, nous considérons la dépendance de contiguïté et de cooccurrence à de différentes distances, c’est-à-dire les bi-grammes et les paires de termes dans une fenêtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dépendants est déterminé selon un ensemble des caractères, en utilisant la régression SVM. Toutes les méthodes proposées sont évaluées sur plusieurs collections en anglais et/ou chinois, et les résultats expérimentaux montrent que ces méthodes produisent des améliorations substantielles sur l'état de l'art.
Resumo:
This paper investigates certain methods of training adopted in the Statistical Machine Translator (SMT) from English to Malayalam. In English Malayalam SMT, the word to word translation is determined by training the parallel corpus. Our primary goal is to improve the alignment model by reducing the number of possible alignments of all sentence pairs present in the bilingual corpus. Incorporating morphological information into the parallel corpus with the help of the parts of speech tagger has brought around better training results with improved accuracy
Resumo:
In Statistical Machine Translation from English to Malayalam, an unseen English sentence is translated into its equivalent Malayalam sentence using statistical models. A parallel corpus of English-Malayalam is used in the training phase. Word to word alignments has to be set among the sentence pairs of the source and target language before subjecting them for training. This paper deals with certain techniques which can be adopted for improving the alignment model of SMT. Methods to incorporate the parts of speech information into the bilingual corpus has resulted in eliminating many of the insignificant alignments. Also identifying the name entities and cognates present in the sentence pairs has proved to be advantageous while setting up the alignments. Presence of Malayalam words with predictable translations has also contributed in reducing the insignificant alignments. Moreover, reduction of the unwanted alignments has brought in better training results. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics.
Resumo:
In Statistical Machine Translation from English to Malayalam, an unseen English sentence is translated into its equivalent Malayalam translation using statistical models like translation model, language model and a decoder. A parallel corpus of English-Malayalam is used in the training phase. Word to word alignments has to be set up among the sentence pairs of the source and target language before subjecting them for training. This paper is deals with the techniques which can be adopted for improving the alignment model of SMT. Incorporating the parts of speech information into the bilingual corpus has eliminated many of the insignificant alignments. Also identifying the name entities and cognates present in the sentence pairs has proved to be advantageous while setting up the alignments. Moreover, reduction of the unwanted alignments has brought in better training results. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Pós-graduação em Estudos Linguísticos - IBILCE
Resumo:
Brazil was one of the countries that stood out in the list of nations that publishes more articles in scientific journals. From 2007 to 2008, the Brazilian scientific production has moved from 15th to 13rd place in the world ranking published articles in professional journals. However, 60% of articles published by the Brazilians are in Portuguese, which makes the Brazilian work have little international attention. The purpose of this research is to build and analyze a parallel corpus composed of a book of Remote Sensing and its translation in the direction English into Portuguese in order to create a glossary of most recurrent terms in the literature of Remote Sensing. The achievement of these goals will take for theoretical and methodological foundation the Corpus-Based Translation Studies (BAKER, 1993, 1995, 1996; CAMARGO, 2005), Corpus Linguistics (BERBER SARDINHA, 2004) and principles of Terminology (BARROS, 2004; KRIEGER & FINATTO, 2004). It will also use Wordsmith Tools program and its tools. Besides the parallel corpus, we will also build two comparable corpora respectively from articles published in Brazilian and international journals in the area. The first results show that the translators made use of greater variation of vocabulary in their translations, which can be a way to make the text more clear to the reader. For the analysis of glossary entries, professionals from the National Institute for Space Research - INPE, will be consulted and their views aggregated to this research to give consistency to the production of the proposed bilingual glossary.
Resumo:
The aim of this research is to build and analyze a parallel corpus in the field of remote sensing in order to identify, according to its frequency, specialized collocations in English and then search for their equivalents in Portuguese. The research is based on the interdisciplinary approach of Corpus-Based Translation Studies (BAKER, 1995; CAMARGO, 2007), Corpus Linguistics (BERBER SARDINHA, 2004; TOGNINI-BONELLI, 2001), Phraseology (ORENHA-OTTAIANO, 2009; PAVEL, 1993), and some principles of Terminology (BARROS, 2004). For manipulating the corpora, the program WordSmith Tools (SCOTT, 2012) version 6.0 is used. To support this study, two comparable corpora in English and Portuguese were also built from articles published in both national and international journals in remote sensing. The results show that the collocations in Portuguese seem to be still in the process of conventionalization, as the translators made use of greater variation in their translational options, which can be a way to make the text clearer for the reader.
Resumo:
[EN] This article focuses on a specific feature found in tourist guidebooks –the recurrent use of foreign expressions or “third language”. It presents the findings of a comparative analysis of a parallel corpus made up of twenty guidebooks: ten guidebooks originally written in English and their corresponding translated versions in Spanish, describing different countries and cities (all of them published by Lonely Planet), focusing on those chapters in which the writer includes practical information. The purpose of the study is to analyze the use of the third language in the English and Spanish versions and to determine and identify the translation strategies used by the translators to transfer these linguistic elements from one language to the other.