900 resultados para Bilingual Corpus
Resumo:
Following the internationalization of contemporary higher education, academic institutions based in non-English speaking countries are increasingly urged to produce contents in English to address international prospective students and personnel, as well as to increase their attractiveness. The demand for English translations in the institutional academic domain is consequently increasing at a rate exceeding the capacity of the translation profession. Resources for assisting non-native authors and translators in the production of appropriate texts in L2 are therefore required in order to help academic institutions and professionals streamline their translation workload. Some of these resources include: (i) parallel corpora to train machine translation systems and multilingual authoring tools; and (ii) translation memories for computer-aided tools. The purpose of this study is to create and evaluate reference resources like the ones mentioned in (i) and (ii) through the automatic sentence alignment of a large set of Italian and English as a Lingua Franca (ELF) institutional academic texts given as equivalent but not necessarily parallel (i.e. translated). In this framework, a set of aligning algorithms and alignment tools is examined in order to identify the most profitable one(s) in terms of accuracy and time- and cost-effectiveness. In order to determine the text pairs to align, a sample is selected according to document length similarity (characters) and subsequently evaluated in terms of extent of noisiness/parallelism, alignment accuracy and content leverageability. The results of these analyses serve as the basis for the creation of an aligned bilingual corpus of academic course descriptions, which is eventually used to create a translation memory in TMX format.
Resumo:
The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation.
Resumo:
Thematization is recognized as a fundamental phenomenon in the construction of messages and texts by di erent linguistic schools. This location within a text privileges the elements that guide the reader in the orientation and interpretation of discourse at di erent levels. Thematizing a linguistic unit by locating it in the rst-initial position of a clause, paragraph, or text, confers upon it a special status: a signal of the organizational strategy which characterizes di erent text types playing a role as a variable in the distinction of registers, text types and genres. However, in spite of the importance of the study of thematization for message and textual structuring, to date there are no linguistic studies that have undertook the task of validating its aspects in a comparative manner, either for linguistic or computational purposes. This study, therefore, lls a research gap by implementing a methodology based on contrastive corpus annotation, which allows to empirically validate aspects of the phenomenon of Thematization in English and Spanish, it also seeks to develop a bilingual English-Spanish comparable corpus of newspaper texts automatically annotated with thematic features at clausal and discourse levels. The empirically validated categories (Thematic Field and its elements: Textual Theme, Interpersonal Theme, PreHead and Head) are used to annotate a larger corpus of three newspaper genres news reports, editorials and letters to the editor in terms of thematic choices. This characterization, reveals interesting results, such as the use of genre-speci c strategies in thematic position. In addition, the thesis investigates the possibility to automate the annotation of thematic features in the bilingual corpus through the development of a set of JAVA rules implemented in GATE. It also shows the e cacy of this method in comparison with the manual annotation results...
Resumo:
Investigou-se pelo presente estudo se a concepção presente na Teoria de Replicadores, expressa através do conceito de meme (DAWKINS, 1979), poderia ser um modelo compatível para explicar a propagação de memes no substrato das mídias sociais. No âmbito dos estudos locais, Recuero (2006) sugeriu uma transdução desse modelo, baseando-se nas concepções de Dawkins (1979). Refletindo sobre o posicionamento epistemológico de Recuero (2006), o presente trabalho, baseando-se em Dennett (1995), Blackmore (2002) e Tyler (2011b; 2013b), procedeu às instâncias de Análise Conceitual e Composicional dessa transdução. A partir do conceito de memeplexo (BLACKMORE, 2002), esta pesquisa de base linguística (HALLIDAY, 1987) entende os memes, no substrato das mídias digitais/sociais, como práticas de produção e distribuição linguístico-midiáticas, propaladas a partir de diversas unidades de propagação e das relações criadas pelos internautas nesse processo de transmissão. Investigando tais relações, a partir da instância de Análise Relacional, propõe-se examinar duas unidades de propagação. Expressões meméticas (Que deselegante e #Tenso) e imagens meméticas (oriundas do fenômeno memético Nana em desastres). Integram este estudo dois corpora de expressões meméticas (5275 postagens oriundas ou redirecionadas para o Twitter.com total de 83.655 palavras/tokens) e um corpus bilíngue (Português/Inglês) de imagens meméticas (um total de 134 imagens oriundas do Tumblr.com e Facebook.com). Para analisar os corpora de expressões meméticas utilizou-se a metodologia de Linguística de Corpus (BERBER-SARDINHA, 2004; SHEPHERD, 2009; SOUZA JÚNIOR, 2012, 2013b, 2013c). Para a análise do corpus multimodal de imagens meméticas, utilizou-se a metodologia que chamamos de Análise Propagatória. Objetivamos verificar se essas unidades de propagação e as práticas linguístico-midiáticas que estas transmitem, evoluiriam somente devido a aspectos memético-midiáticos, conforme Recuero (2006) apontara, e com padrão de propagação internalista (DAWKINS, 1979; 1982). Após análise dos dados, revelou-se que, ao nível do propósito, os fenômenos locais investigados não evoluíram por padrão internalista (ou homogêneo) de propagação. Tais padrões revelam ser de natureza externalista (ou heterogênea). Ademais, constatou-se que princípios constitutivos meméticos de evolução como os de fecundidade, longevidade (DAWKINS 1979; 1982) e o de design (DENNETT, 1995), junto com o princípio midiático de evolução de alcance (RECUERO, 2006) mantiveram-se presentes com alto grau de influencia nas propagações de natureza externalista. Por outro lado, o princípio memético da fidelidade (DAWKINS, 1979; 1982) foi o que menos influenciou esses padrões de propagação. Neutralizando a fidelidade, e impulsionados pelo princípio de design, destacaram-se nesse processo evolutivo os princípios linguísticos sistematizadores revelados por este estudo. Isto é: o princípio da funcionalidade (memes evoluem porque podem indicar propósitos diferentes) e o princípio do alcance linguístico (memes podem ser direcionados a itens animados/ inanimados; para internautas em idioma nativo/ estrangeiro)
Resumo:
This paper underlines a methodology for translating text from English into the Dravidian language, Malayalam using statistical models. By using a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase, the machine automatically generates Malayalam translations of English sentences. This paper also discusses a technique to improve the alignment model by incorporating the parts of speech information into the bilingual corpus. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in training. Various handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. The structural difference between the English Malayalam pair is resolved in the decoder by applying the order conversion rules. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
This paper investigates certain methods of training adopted in the Statistical Machine Translator (SMT) from English to Malayalam. In English Malayalam SMT, the word to word translation is determined by training the parallel corpus. Our primary goal is to improve the alignment model by reducing the number of possible alignments of all sentence pairs present in the bilingual corpus. Incorporating morphological information into the parallel corpus with the help of the parts of speech tagger has brought around better training results with improved accuracy
Resumo:
A methodology for translating text from English into the Dravidian language, Malayalam using statistical models is discussed in this paper. The translator utilizes a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase and generates automatically the Malayalam translation of an unseen English sentence. Various techniques to improve the alignment model by incorporating the morphological inputs into the bilingual corpus are discussed. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in producing better alignments. Difficulties in translation process that arise due to the structural difference between the English Malayalam pair is resolved in the decoding phase by applying the order conversion rules. The handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
In Statistical Machine Translation from English to Malayalam, an unseen English sentence is translated into its equivalent Malayalam sentence using statistical models. A parallel corpus of English-Malayalam is used in the training phase. Word to word alignments has to be set among the sentence pairs of the source and target language before subjecting them for training. This paper deals with certain techniques which can be adopted for improving the alignment model of SMT. Methods to incorporate the parts of speech information into the bilingual corpus has resulted in eliminating many of the insignificant alignments. Also identifying the name entities and cognates present in the sentence pairs has proved to be advantageous while setting up the alignments. Presence of Malayalam words with predictable translations has also contributed in reducing the insignificant alignments. Moreover, reduction of the unwanted alignments has brought in better training results. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics.
Resumo:
In Statistical Machine Translation from English to Malayalam, an unseen English sentence is translated into its equivalent Malayalam translation using statistical models like translation model, language model and a decoder. A parallel corpus of English-Malayalam is used in the training phase. Word to word alignments has to be set up among the sentence pairs of the source and target language before subjecting them for training. This paper is deals with the techniques which can be adopted for improving the alignment model of SMT. Incorporating the parts of speech information into the bilingual corpus has eliminated many of the insignificant alignments. Also identifying the name entities and cognates present in the sentence pairs has proved to be advantageous while setting up the alignments. Moreover, reduction of the unwanted alignments has brought in better training results. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Word Sense Disambiguation, the process of identifying the meaning of a word in a sentence when the word has multiple meanings, is a critical problem of machine translation. It is generally very difficult to select the correct meaning of a word in a sentence, especially when the syntactical difference between the source and target language is big, e.g., English-Korean machine translation. To achieve a high level of accuracy of noun sense selection in machine translation, we introduced a statistical method based on co-occurrence relation of words in sentences and applied it to the English-Korean machine translator RyongNamSan. ACM Computing Classification System (1998): I.2.7.
Resumo:
Phraseological units are complex structures that may be difficult to comprehend and transfer into other languages due to their idiomatic nature. The translator of English legal texts often comes across binomials, a type of phraseological unit that is a characteristic of this specialized discourse. Based on a specialized comparable bilingual corpus composed of legal forms and agreements, this article identifies several occurrences of this phraseological structure and extracts the most frequent examples in English and Spanish. A contrastive analysis of the data obtained from the corpus helps to establish a series of equivalencies among binomials in both languages and proposes a typology of equivalences regarding these phraseological structures.
Resumo:
The impact of Greek-Egyptian bilingualism on language use and linguistic competence is the key issue in this dissertation. The language use in a corpus of 148 Greek notarial contracts is analyzed on phonological, morphological and syntactic levels. The texts were written by bilingual notaries (agoranomoi) in Upper Egypt in the later Hellenistic period. They present, for the most part, very good administrative Greek. On the other hand, their language contains variation and idiosyncrasies that were earlier condemned as ungrammatical and bad Greek, and were not subjected to closer analysis. In order to reach plausible explanations for those phenomena, a thorough research into the sociohistorical and linguistic context was needed before the linguistic analysis. The general linguistic landscape, the population pattern and the status and frequency of Greek literacy in Ptolemaic Egypt in general, and in Upper Egypt in particular, are presented. Through a detailed examination of the notaries themselves (their names, families and handwriting), it became evident that there were one to three persons at the notarial office writing under the signature of one notary. Often the documents under one notary's name were written in the same hand. We get, therefore, exceptionally close to studying idiolects in written material from antiquity. The qualitative linguistic analysis revealed that the notaries made relatively few orthographic mistakes that reflect the ongoing phonological changes and they mastered the morphological forms. The problems arose at the syntactic level, for example, with the pattern of agreement between the noun groups or a noun with its modifiers. The significant structural differences between Greek and Egyptian can be behind the innovative strategies used by some of the notaries. Moreover, certain syntactic structures were clearly transferred from the notaries first language, Egyptian. This is obvious in the relative clause structure. Transfer can be found in other structures, as well, although, we must not forget the influence of parallel Greek structures. Sometimes these can act simultaneously. The interesting linguistic strategies and transfer features come mostly from the hand of one notary, Hermias. Some other notaries show similar patterns, for example, Hermias' cousin, Ammonios. Hermias' texts reveal that he probably spoke Greek more than his predecessors. It is possible to conclude, then, that the notaries of the later generations were more fluently bilingual; their two languages were partly integrated in their minds as an interlanguage combining elements from both languages. The earlier notaries had the two languages functionally separated and they followed the standardized contract formulae more rigidly.
Resumo:
Presentation for the 5th International Conference on Corpus Linguistics (CILC 2013), V Congreso Internacional de Lingüistica de Corpus.
Resumo:
Edited by Andrea Abel, Chiara Vettori, Natascia Ralli.