872 resultados para Comparable Corpus
Resumo:
Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005). Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie. L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais. Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables. Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
El objetivo de este Proyecto Fin de Carrera es abordar el análisis del capítulo de conclusiones de tesis de ingeniería de telecomunicación a partir de un corpus comparable en inglés y español. A través del léxico podrán conocerse las expresiones típicas y la estructura de capítulo de conclusiones, tanto en inglés como en español. Para empezar este Proyecto, se ha compilado los corpus que se quieren comparar, en total se ha digitalizado tres corpus, uno con 24 conclusiones de tesis doctorales en español, otro con el mismo número de capítulos de conclusiones de tesis doctorales en inglés (PhD) y por último un corpus de conclusiones de tesis de fin de máster y de grado. El primer análisis que se ha realizado es el de la estructura de las conclusiones a partir de los títulos y subtítulos del capítulo. Se han comparado los títulos más utilizados y se han comentado las coincidencias y diferencias entre los corpus. La estructura vista a través de los subtítulos, se ha comparado con la propuesta por la autora Glasman-Deal (2011) en trabajos académicos de investigación, principalmente en artículos de investigación. La siguiente parte del Proyecto se ha centrado en el estudio del léxico, para ello nos hemos ayudado de la herramienta informática Wordsmith tools de la que se han explicado sus herramientas y funciones más útiles para este trabajo entre ellas el plot, que informa número de archivos en la que aparece cada palabra en el corpus. Las palabras con mayor plot son las más usadas por todos los doctorandos cuando escriben el capítulo de conclusiones .Se han elaborado unas pirámides donde se han colocado las palabras propias del género académico de las tesis por orden de uso. Las más usadas, con mayor plot, en la base y según se asciende aparecen las que tienen menor plot, con el fin de ver de una forma gráfica el peso que tiene cada palabra en el corpus. El siguiente paso del análisis del léxico ha tenido el objetivo de diferenciar los contextos de uso de las palabras incluidas en las pirámides. Se ha diferenciado entre los usos de las palabras dependiendo de su denotación académica o técnica. Esta comparación ha permitido comprobar que dentro del mismo corpus un sustantivo como contribuciones tiene connotación positiva o negativa dependiendo del contexto. Con los ejemplos aportados por los corpus se proporciona una base para el análisis lingüístico, centrado en los sustantivos, en este trabajo. Para finalizar el Proyecto, se ha implementado una base de datos con los resultados obtenidos del análisis de los sustantivos en la que se pueden ver las palabras que corresponden a cada nivel de la pirámide y ejemplos del uso de estas palabras. The aim of this Project is to analyze the concluding chapter of PhD thesis in the field of telecommunication engineering by means of a comparable corpus in English and Spanish. Through the lexis we will be able to capture useful expressions and the typical structure of the chapter in these specialized thesis, either in English and Spanish. To start with, three corpora have been compiled. The first one consists of 24 concluding chapters of PhD thesis in Spanish; the second, is made up of the same number of chapters of PhD thesis in the English language; and finally, 24 further chapters of Master and Degree thesis in English were digitalized and prepared for lexis analysis. Second, the study of the structure of the chapter of conclusions has been carried out. In this part the most common titles in the chapter of conclusions have been analysed and compared so as to find differences and similarities between the two languages compared. Moreover, the structure found through the subtitles in the conclusions of the thesis has been compared with the structure proposed by Glasman-Deal (2011) in her book Science Research Writing. Third, the study has been focused on the lexis of each corpus. These corpora have been treated with a lexis analyser called Wordsmith tools. The variables of frequency and plot have been applied to withdraw the most widely used nouns from the list of all the words found in any of the corpus. A pyramidal structure has been designed in order to show the academic or gender nouns - the ones usually found in the concluding chapter of thesis – nouns with a higher plot in the corpus. Two different types of context have been found for these nouns: technical and academic denotation. To show the difference in use of these nouns, arranged examples of contexts are given for each of the words studied. Finally, a database has been implemented to arrange the results of the lexis study. In this database the most significant examples of each noun are shown.
Resumo:
Thematization is recognized as a fundamental phenomenon in the construction of messages and texts by di erent linguistic schools. This location within a text privileges the elements that guide the reader in the orientation and interpretation of discourse at di erent levels. Thematizing a linguistic unit by locating it in the rst-initial position of a clause, paragraph, or text, confers upon it a special status: a signal of the organizational strategy which characterizes di erent text types playing a role as a variable in the distinction of registers, text types and genres. However, in spite of the importance of the study of thematization for message and textual structuring, to date there are no linguistic studies that have undertook the task of validating its aspects in a comparative manner, either for linguistic or computational purposes. This study, therefore, lls a research gap by implementing a methodology based on contrastive corpus annotation, which allows to empirically validate aspects of the phenomenon of Thematization in English and Spanish, it also seeks to develop a bilingual English-Spanish comparable corpus of newspaper texts automatically annotated with thematic features at clausal and discourse levels. The empirically validated categories (Thematic Field and its elements: Textual Theme, Interpersonal Theme, PreHead and Head) are used to annotate a larger corpus of three newspaper genres news reports, editorials and letters to the editor in terms of thematic choices. This characterization, reveals interesting results, such as the use of genre-speci c strategies in thematic position. In addition, the thesis investigates the possibility to automate the annotation of thematic features in the bilingual corpus through the development of a set of JAVA rules implemented in GATE. It also shows the e cacy of this method in comparison with the manual annotation results...
Resumo:
Este trabalho tem por objetivo verificar os padrões adotados para retratar o protagonista das Crônicas de Nárnia de C. S. Lewis o leão Aslan - na sua tradução para o português brasileiro. A dissertação objetiva também estabelecer se o paralelo sugerido pelo autor das Crônicas entre Aslan e a figura do Deus cristão é captado para o português brasileiro. Baseamo-nos na carta documental escrita por Lewis, na qual ele descreve a relação entre sua obra e o texto bíblico, principalmente no seu personagem principal, Aslan. Como arcabouço teórico adotado, discorremos sobre algumas correntes dos Estudos da Tradução que foram úteis para a análise do corpus o conceito de estilo e de afastamento do original e adaptação (BAKER, 1993, SHUTTLEWORTH,1999, TOURY, 1980). Como metalinguagem que nos ajuda a explicar as mudanças ocorridas entre o texto original e o texto traduzido, utilizamos a Gramática Sistêmico-Funcional, principalmente a função experiencial descrita pelos Processos (em especial os Processos Verbais), e a função interpessoal representada pela Teoria da Valoração (Appraisal). Para a análise dos textos, foi compilado um corpus paralelo contendo os sete livros da série em inglês e português e também um corpus comparável, utilizado para ratificar os resultados encontrados. Esses resultados apontam para um afastamento do texto traduzido radical em relação ao texto original no que se refere à construção do personagem Aslan, a saber: mudanças na prosódia semântica, mudanças de Força, omissões ou adições que alteram o sentido e mudança na agência dos Processos Verbais. Sugerimos que esses contrastes entre as alegorias tecidas por Lewis no texto original e o que encontramos na fala traduzida de Aslan podem alterar a percepção que se tem de Aslan como um símbolo cristão, quando reescrito em português do Brasil
Criteria for the validation of specialized verb equivalents : application in bilingual terminography
Resumo:
Multilingual terminological resources do not always include valid equivalents of legal terms for two main reasons. Firstly, legal systems can differ from one language community to another and even from one country to another because each has its own history and traditions. As a result, the non-isomorphism between legal and linguistic systems may render the identification of equivalents a particularly challenging task. Secondly, by focusing primarily on the definition of equivalence, a notion widely discussed in translation but not in terminology, the literature does not offer solid and systematic methodologies for assigning terminological equivalents. As a result, there is a lack of criteria to guide both terminologists and translators in the search and validation of equivalent terms. This problem is even more evident in the case of predicative units, such as verbs. Although some terminologists (L‘Homme 1998; Lerat 2002; Lorente 2007) have worked on specialized verbs, terminological equivalence between units that belong to this part of speech would benefit from a thorough study. By proposing a novel methodology to assign the equivalents of specialized verbs, this research aims at defining validation criteria for this kind of predicative units, so as to contribute to a better understanding of the phenomenon of terminological equivalence as well as to the development of multilingual terminography in general, and to the development of legal terminography, in particular. The study uses a Portuguese-English comparable corpus that consists of a single genre of texts, i.e. Supreme Court judgments, from which 100 Portuguese and 100 English specialized verbs were selected. The description of the verbs is based on the theory of Frame Semantics (Fillmore 1976, 1977, 1982, 1985; Fillmore and Atkins 1992), on the FrameNet methodology (Ruppenhofer et al. 2010), as well as on the methodology for compiling specialized lexical resources, such as DiCoInfo (L‘Homme 2008), developed in the Observatoire de linguistique Sens-Texte at the Université de Montréal. The research reviews contributions that have adopted the same theoretical and methodological framework to the compilation of lexical resources and proposes adaptations to the specific objectives of the project. In contrast to the top-down approach adopted by FrameNet lexicographers, the approach described here is bottom-up, i.e. verbs are first analyzed and then grouped into frames for each language separately. Specialized verbs are said to evoke a semantic frame, a sort of conceptual scenario in which a number of mandatory elements (core Frame Elements) play specific roles (e.g. ARGUER, JUDGE, LAW), but specialized verbs are often accompanied by other optional information (non-core Frame Elements), such as the criteria and reasons used by the judge to reach a decision (statutes, codes, previous decisions). The information concerning the semantic frame that each verb evokes was encoded in an xml editor and about twenty contexts illustrating the specific way each specialized verb evokes a given frame were semantically and syntactically annotated. The labels attributed to each semantic frame (e.g. [Compliance], [Verdict]) were used to group together certain synonyms, antonyms as well as equivalent terms. The research identified 165 pairs of candidate equivalents among the 200 Portuguese and English terms that were grouped together into 76 frames. 71% of the pairs of equivalents were considered full equivalents because not only do the verbs evoke the same conceptual scenario but their actantial structures, the linguistic realizations of the actants and their syntactic patterns were similar. 29% of the pairs of equivalents did not entirely meet these criteria and were considered partial equivalents. Reasons for partial equivalence are provided along with illustrative examples. Finally, the study describes the semasiological and onomasiological entry points that JuriDiCo, the bilingual lexical resource compiled during the project, offers to future users.
Resumo:
Afin d'enrichir les données de corpus bilingues parallèles, il peut être judicieux de travailler avec des corpus dits comparables. En effet dans ce type de corpus, même si les documents dans la langue cible ne sont pas l'exacte traduction de ceux dans la langue source, on peut y retrouver des mots ou des phrases en relation de traduction. L'encyclopédie libre Wikipédia constitue un corpus comparable multilingue de plusieurs millions de documents. Notre travail consiste à trouver une méthode générale et endogène permettant d'extraire un maximum de phrases parallèles. Nous travaillons avec le couple de langues français-anglais mais notre méthode, qui n'utilise aucune ressource bilingue extérieure, peut s'appliquer à tout autre couple de langues. Elle se décompose en deux étapes. La première consiste à détecter les paires d’articles qui ont le plus de chance de contenir des traductions. Nous utilisons pour cela un réseau de neurones entraîné sur un petit ensemble de données constitué d'articles alignés au niveau des phrases. La deuxième étape effectue la sélection des paires de phrases grâce à un autre réseau de neurones dont les sorties sont alors réinterprétées par un algorithme d'optimisation combinatoire et une heuristique d'extension. L'ajout des quelques 560~000 paires de phrases extraites de Wikipédia au corpus d'entraînement d'un système de traduction automatique statistique de référence permet d'améliorer la qualité des traductions produites. Nous mettons les données alignées et le corpus extrait à la disposition de la communauté scientifique.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Pós-graduação em Estudos Linguísticos - IBILCE
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
This article briefly reviews multilingual language resources for Bulgarian, developed in the frame of some international projects: the first-ever annotated Bulgarian MTE digital lexical resources, Bulgarian-Polish corpus, Bulgarian-Slovak parallel and aligned corpus, and Bulgarian-Polish-Lithuanian corpus. These resources are valuable multilingual dataset for language engineering research and development for Bulgarian language. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures.
Resumo:
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.
Resumo:
Computational models of meaning trained on naturally occurring text successfully model human performance on tasks involving simple similarity measures, but they characterize meaning in terms of undifferentiated bags of words or topical dimensions. This has led some to question their psychological plausibility (Murphy, 2002; Schunn, 1999). We present here a fully automatic method for extracting a structured and comprehensive set of concept descriptions directly from an English part-of-speech-tagged corpus. Concepts are characterized by weighted properties, enriched with concept-property types that approximate classical relations such as hypernymy and function. Our model outperforms comparable algorithms in cognitive tasks pertaining not only to concept-internal structures (discovering properties of concepts, grouping properties by property type) but also to inter-concept relations (clustering into superordinates), suggesting the empirical validity of the property-based approach. Copyright © 2009 Cognitive Science Society, Inc. All rights reserved.
Resumo:
The objective of this study was to evaluate pregnancy rates of recipients of different breed groups (Nellore and crossbreed), as well as the effects of size and type of the corpus luteum (CL) on plasmatic concentrations of progesterone and pregnancy rates of embryo recipients. A total of 152 heifers were synchronized with progesterone implants and on the day of embryo transfer, previously obtained by superovulation and frozen in ethylene glycol, the diameter and type of the corpus luteum (cavitary and compact) was measured and blood was collected for progesterone measurement. The pregnancy rate was 44.1%, with a diameter of corpus luteum higher in recipients that became pregnant (2.03±0.41) compared with non-pregnant ones (1.86±0.34 cm). Plasmatic concentrations of progesterone did not differ between pregnant (1.50±1.05) and non-pregnant (1.31±0.91 ng/mL) animals. The type of corpus luteum did not influence the pregnancy rates. Only Angus and crossbred Marchigiana differ among themselves in pregnancy rates (33.3 and 59.2%, respectively). The pregnancy probability was affected only by CL diameter, but not by P4 plasmatic concentration. Selection of the corpus luteum size at the time of embryo transfer is an important factor to increase pregnancy rates in recipients, and compact and cavitary corpora lutea do not influence the pregnancy rates of bovine embryo recipients. Nellore recipients have pregnancy rates that are satisfactory and comparable to crossbred (Bos taurus × Bos indicus) recipients.
Resumo:
O objetivo principal deste trabalho foi propor uma reflexão sobre o processo a ser utilizado para a elaboração de um léxico bilíngüe na subárea de cardiologia. Para tanto, tomamos como base os conceitos dos estudos da tradução baseados em corpus, da lingüística de corpus e da terminologia. Como material para compor os corpora utilizamos artigos de cardiologia escritos em português e traduzidos para o inglês, assim como artigos originalmente escritos em português e em inglês. Com base no léxico proposto, pudemos notar algumas diferenças e algumas correspondências de uso entre os termos que aparecem no subcorpus de estudo de textos originais e traduzidos e nos corpora comparáveis em português e em inglês. Essa diferença apontaria que os termos não seriam unívocos dentro dessa linguagem de especialidade devido às diferenças de uso pelos especialistas de cardiologia para designar um mesmo referente.