986 resultados para Machine Translation
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Machine translation systems have been increasingly used for translation of large volumes of specialized texts. The efficiency of these systems depends directly on the implementation of strategies for controlling lexical use of source texts as a way to guarantee machine performance and, ultimately, human revision and post-edition work. This paper presents a brief history of application of machine translation, introduces the concept of lexicon and ambiguity and focuses on some of the lexical control strategies presently used, discussing their possible implications for the production and reading of specialized texts.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
This paper analyzes how machine translation has changed the way translation is conceived and practiced in the information age. From a brief review of the early designs of machine translation programs, I discuss the changes implemented in the past decades in these systems to combine mechanical processing and the accessory work by the translator.
Resumo:
The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shed light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts. Copyright (C) EPLA, 2012
Resumo:
Computer-assisted translation (or computer-aided translation or CAT) is a form of language translation in which a human translator uses computer software in order to facilitate the translation process. Machine translation (MT) is the automated process by which a computerized system produces a translated text or speech from one natural language to another. Both of them are leading and promising technologies in the translation industry; it therefore seems important that translation students and professional translators become familiar with this relatively new types of technology. Whether used together, not only might these two different types of systems reduce translation time, but also lead to a further improvement in the field of translation technologies. The dissertation consists of four chapters. The first one surveys the chronological development of MT and CAT tools, the emergence of pre-editing, post-editing and controlled language and the very last frontiers in this sector. The second one provide a general overview on the four main CAT tools that are used nowadays and tested hereto. The third chapter is dedicated to the experimentations that have been conducted in order to analyze and evaluate the performance of the four integrated systems that are the core subject of this dissertation. Finally, the fourth chapter deals with the issue of terminological equivalence in interlinguistic translation. The purpose of this dissertation is not to provide an objective and definitive solution to the complex issues that arise at any time in the field of translation technologies, this aim being well away from being achieved, but to supply information about the limits and potentiality that are typical of those instruments which are now essential to any professional translator.
Resumo:
This dissertation was conducted within the project Language Toolkit, which has the aim of integrating the worlds of work and university. In particular, it consists of the translation into English of documents commissioned by the Italian company TR Turoni and its primary purpose is to demonstrate that, in the field of translation for companies, the existing translation support tools and software can optimise and facilitate the translation process. The work consists of five chapters. The first introduces the Language Toolkit project, the TR Turoni company and its relationship with the CERMAC export consortium. After outlining the current state of company internationalisation, the importance of professional translators in enhancing the competitiveness of companies that enter new international markets is highlighted. Chapter two provides an overview of the texts to be translated, focusing on the textual function and typology and on the addressees. After that, manual translation and the main software developed specifically for translators are described, with a focus on computer-assisted translation (CAT) and machine translation (MT). The third chapter presents the target texts and the corresponding translations. Chapter four is dedicated to the analysis of the translation process. The first two texts were translated manually, with the support of a purpose-built specialized corpus. The following two documents were translated with the software SDL Trados Studio 2011 and its applications. The last texts were submitted to the Google Translate service and to a process of pre and post-editing. Finally, in chapter five conclusions are drawn about the main limits and potentialities of the different translations techniques. In addition to this, the importance of an integrated use of all available instruments is underlined.
Resumo:
The aim of this dissertation is to provide a translation from English into Italian of a highly specialized scientific article published by the online journal ALTEX. In this text, the authors propose a roadmap for how to overcome the acknowledged scientific gaps for the full replacement of systemic toxicity testing using animals. The main reasons behind this particular choice are my personal interest in specialized translation of scientific texts and in the alternatives to animal testing. Moreover, this translation has been directly requested by the Italian molecular biologist and clinical biochemist Candida Nastrucci. It was not possible to translate the whole article in this project, for this reason, I decided to translate only the introduction, the chapter about skin sensitization, and the conclusion. I intend to use the resources that were created for this project to translate the rest of the article in the near future. In this study, I will show how a translator can translate such a specialized text with the help of a field expert using CAT Tools and a specialized corpus. I will also discuss whether machine translation can prove useful to translate this type of document. This work is divided into six chapters. The first one introduces the main topic of the article and explains my reasons for choosing this text; the second one contains an analysis of the text type, focusing on the differences and similarities between Italian and English conventions. The third chapter provides a description of the resources that were used to translate this text, i.e. the corpus and the CAT Tools. The fourth one contains the actual translation, side-by-side with the original text, while the fifth one provides a general comment on the translation difficulties, an analysis of my translation choices and strategies, and a comment about the relationship between the field expert and the translator. Finally, the last chapter shows whether machine translation and post-editing can be an advantageous strategy to translate this type of document. The project also contains two appendixes. The first one includes 54 complex terminological sheets, while the second one includes 188 simple terminological sheets.
Resumo:
This dissertation is part of the Language Toolkit project which is a collaboration between the School of Foreign Languages and Literature, Interpreting and Translation of the University of Bologna, Forlì campus, and the Chamber of Commerce of Forlì-Cesena. This project aims to create an exchange between translation students and companies who want to pursue a process of internationalization. The purpose of this dissertation is demonstrating the benefits that translation systems can bring to businesses. In particular, it consists of the translation into English of documents supplied by the Italian company Technologica S.r.l. and the creation of linguistic resources that can be integrated into computer-assisted translation (CAT) software, in order to optimize the translation process. The latter is claimed to be a priority with respect to the actual translation products (the target texts), since the analysis conducted on the source texts highlighted that the company could streamline and optimize its English language communication thanks to the use of open source CAT tools such as OmegaT. The work consists of five chapters. The first introduces the Language Toolkit project, the company (Technologica S.r.l ) and its products. The second chapter provides some considerations about technical translation, its features and some misconceptions about it. The difference between technical translation and scientific translation is then clarified and an overview is offered of translation aids such as those used for computer-assisted translation, machine translation, termbases and translation memories. The third chapter contains the analysis of the texts commissioned by Technologica S.r.l. and their categorization. The fourth chapter describes the translation process, with particular attention to terminology extraction and the creation of a bilingual glossary based on a specialized corpus. The glossary was integrated into the OmegaT software in order to facilitate the translation process both for the present task and for future applications. The memory deriving from the translation represents a sort of hybrid resource between a translation memory and a glossary. This was found to be the most appropriate format, given the specific nature of the texts to be translated. Finally, in chapter five conclusions are offered about the importance of language training within a company environment, the potentialities of translation aids and the benefits that they would bring to a company wishing to internationalize itself.
Resumo:
La presente tesi nasce da un tirocinio avanzato svolto presso l’azienda CTI (Communication Trend Italia) di Milano. Gli obiettivi dello stage erano la verifica della possibilità di inserire gli strumenti automatici nel flusso di lavoro dell’azienda e l'individuazione delle tipologie testuali e delle combinazioni linguistiche a cui essi sono applicabili. Il presente elaborato si propone di partire da un’analisi teorica dei vari aspetti legati all’utilizzo della TA, per poi descriverne l’applicazione pratica nei procedimenti che hanno portato alla creazione dei sistemi custom. Il capitolo 1 offre una panoramica teorica sul mondo della machine translation, che porta a delineare la modalità di utilizzo della TA ad oggi più diffusa: quella in cui la traduzione fornita dal sistema viene modificata tramite post-editing oppure il testo di partenza viene ritoccato attraverso il pre-editing per eliminare gli elementi più ostici. Nel capitolo 2, partendo da una panoramica relativa ai principali software di traduzione automatica in uso, si arriva alla descrizione di Microsoft Translator Hub, lo strumento scelto per lo sviluppo dei sistemi custom di CTI. Nel successivo passaggio, l’attenzione si concentra sull’ottenimento di sistemi customizzati. Un ampio approfondimento è dedicato ai metodi per reperire ed utilizzare le risorse. In seguito viene descritto il percorso che ha portato alla creazione e allo sviluppo dei due sistemi Bilanci IT_EN e Atto Costitutivo IT_EN in Microsoft Translator Hub. Infine, nel quarto ed ultimo capitolo gli output che i due sistemi forniscono vengono rivisti per individuarne le caratteristiche e analizzati tramite alcuni tool di valutazione automatica. Grazie alle informazioni raccolte vengono poi formulate alcune previsioni sul futuro uso dei sistemi presso l’azienda CTI.
Resumo:
Review of this book, that is the author's Thesis Dissertation.
Resumo:
OntoTag - A Linguistic and Ontological Annotation Model Suitable for the Semantic Web
1. INTRODUCTION. LINGUISTIC TOOLS AND ANNOTATIONS: THEIR LIGHTS AND SHADOWS
Computational Linguistics is already a consolidated research area. It builds upon the results of other two major ones, namely Linguistics and Computer Science and Engineering, and it aims at developing computational models of human language (or natural language, as it is termed in this area). Possibly, its most well-known applications are the different tools developed so far for processing human language, such as machine translation systems and speech recognizers or dictation programs.
These tools for processing human language are commonly referred to as linguistic tools. Apart from the examples mentioned above, there are also other types of linguistic tools that perhaps are not so well-known, but on which most of the other applications of Computational Linguistics are built. These other types of linguistic tools comprise POS taggers, natural language parsers and semantic taggers, amongst others. All of them can be termed linguistic annotation tools.
Linguistic annotation tools are important assets. In fact, POS and semantic taggers (and, to a lesser extent, also natural language parsers) have become critical resources for the computer applications that process natural language. Hence, any computer application that has to analyse a text automatically and ‘intelligently’ will include at least a module for POS tagging. The more an application needs to ‘understand’ the meaning of the text it processes, the more linguistic tools and/or modules it will incorporate and integrate.
However, linguistic annotation tools have still some limitations, which can be summarised as follows:
1. Normally, they perform annotations only at a certain linguistic level (that is, Morphology, Syntax, Semantics, etc.).
2. They usually introduce a certain rate of errors and ambiguities when tagging. This error rate ranges from 10 percent up to 50 percent of the units annotated for unrestricted, general texts.
3. Their annotations are most frequently formulated in terms of an annotation schema designed and implemented ad hoc.
A priori, it seems that the interoperation and the integration of several linguistic tools into an appropriate software architecture could most likely solve the limitations stated in (1). Besides, integrating several linguistic annotation tools and making them interoperate could also minimise the limitation stated in (2). Nevertheless, in the latter case, all these tools should produce annotations for a common level, which would have to be combined in order to correct their corresponding errors and inaccuracies. Yet, the limitation stated in (3) prevents both types of integration and interoperation from being easily achieved.
In addition, most high-level annotation tools rely on other lower-level annotation tools and their outputs to generate their own ones. For example, sense-tagging tools (operating at the semantic level) often use POS taggers (operating at a lower level, i.e., the morphosyntactic) to identify the grammatical category of the word or lexical unit they are annotating. Accordingly, if a faulty or inaccurate low-level annotation tool is to be used by other higher-level one in its process, the errors and inaccuracies of the former should be minimised in advance. Otherwise, these errors and inaccuracies would be transferred to (and even magnified in) the annotations of the high-level annotation tool.
Therefore, it would be quite useful to find a way to
(i) correct or, at least, reduce the errors and the inaccuracies of lower-level linguistic tools;
(ii) unify the annotation schemas of different linguistic annotation tools or, more generally speaking, make these tools (as well as their annotations) interoperate.
Clearly, solving (i) and (ii) should ease the automatic annotation of web pages by means of linguistic tools, and their transformation into Semantic Web pages (Berners-Lee, Hendler and Lassila, 2001). Yet, as stated above, (ii) is a type of interoperability problem. There again, ontologies (Gruber, 1993; Borst, 1997) have been successfully applied thus far to solve several interoperability problems. Hence, ontologies should help solve also the problems and limitations of linguistic annotation tools aforementioned.
Thus, to summarise, the main aim of the present work was to combine somehow these separated approaches, mechanisms and tools for annotation from Linguistics and Ontological Engineering (and the Semantic Web) in a sort of hybrid (linguistic and ontological) annotation model, suitable for both areas. This hybrid (semantic) annotation model should (a) benefit from the advances, models, techniques, mechanisms and tools of these two areas; (b) minimise (and even solve, when possible) some of the problems found in each of them; and (c) be suitable for the Semantic Web. The concrete goals that helped attain this aim are presented in the following section.
2. GOALS OF THE PRESENT WORK
As mentioned above, the main goal of this work was to specify a hybrid (that is, linguistically-motivated and ontology-based) model of annotation suitable for the Semantic Web (i.e. it had to produce a semantic annotation of web page contents). This entailed that the tags included in the annotations of the model had to (1) represent linguistic concepts (or linguistic categories, as they are termed in ISO/DCR (2008)), in order for this model to be linguistically-motivated; (2) be ontological terms (i.e., use an ontological vocabulary), in order for the model to be ontology-based; and (3) be structured (linked) as a collection of ontology-based