886 resultados para Corpus Based Machine Translation
Resumo:
This article describes the developmentof an Open Source shallow-transfer machine translation system from Czech to Polish in theApertium platform. It gives details ofthe methods and resources used in contructingthe system. Although the resulting system has quite a high error rate, it is still competitive with other systems.
Resumo:
This paper proposes to enrich RBMTdictionaries with Named Entities(NEs) automatically acquired fromWikipedia. The method is appliedto the Apertium English-Spanishsystem and its performance comparedto that of Apertium with and withouthandtagged NEs. The system withautomatic NEs outperforms the onewithout NEs, while results vary whencompared to a system with handtaggedNEs (results are comparable forSpanish to English but slightly worstfor English to Spanish). Apart fromthat, adding automatic NEs contributesto decreasing the amount of unknownterms by more than 10%.
Resumo:
This paper describes the development of a two-way shallow-transfer rule-based machine translation system between Bulgarian and Macedonian. It gives an account of the resources and the methods used for constructing the system, including the development of monolingual and bilingual dictionaries, syntactic transfer rules and constraint grammars. An evaluation of thesystem's performance was carried out and compared to another commercially available MT system for the two languages. Some future work was suggested.
Resumo:
Due to the emergence of multiple language support on the Internet, machine translation (MT) technologies are indispensable to the communication between speakers using different languages. Recent research works have started to explore tree-based machine translation systems with syntactical and morphological information. This work aims the development of Syntactic Based Machine Translation from English to Malayalam by adding different case information during translation. The system identifies general rules for various sentence patterns in English. These rules are generated using the Parts Of Speech (POS) tag information of the texts. Word Reordering based on the Syntax Tree is used to improve the translation quality of the system. The system used Bilingual English –Malayalam dictionary for translation.
Resumo:
This thesis summarizes the results on the studies on a syntax based approach for translation between Malayalam, one of Dravidian languages and English and also on the development of the major modules in building a prototype machine translation system from Malayalam to English. The development of the system is a pioneering effort in Malayalam language unattempted by previous researchers. The computational models chosen for the system is first of its kind for Malayalam language. An in depth study has been carried out in the design of the computational models and data structures needed for different modules: morphological analyzer , a parser, a syntactic structure transfer module and target language sentence generator required for the prototype system. The generation of list of part of speech tags, chunk tags and the hierarchical dependencies among the chunks required for the translation process also has been done. In the development process, the major goals are: (a) accuracy of translation (b) speed and (c) space. Accuracy-wise, smart tools for handling transfer grammar and translation standards including equivalent words, expressions, phrases and styles in the target language are to be developed. The grammar should be optimized with a view to obtaining a single correct parse and hence a single translated output. Speed-wise, innovative use of corpus analysis, efficient parsing algorithm, design of efficient Data Structure and run-time frequency-based rearrangement of the grammar which substantially reduces the parsing and generation time are required. The space requirement also has to be minimised
Resumo:
This paper discusses the qualitativecomparative evaluation performed on theresults of two machine translation systemswith different approaches to the processing ofmulti-word units. It proposes a solution forovercoming the difficulties multi-word unitspresent to machine translation by adopting amethodology that combines the lexicongrammar approach with OpenLogos ontologyand semantico-syntactic rules. The paper alsodiscusses the importance of a qualitativeevaluation metrics to correctly evaluate theperformance of machine translation engineswith regards to multi-word units.
Resumo:
Softcatalà is a non-profit associationcreated more than 10 years ago to fightthe marginalisation of the Catalan languagein information and communicationtechnologies. It has led the localisationof many applications and thecreation of a website which allows itsusers to translate texts between Spanishand Catalan using an external closed-sourcetranslation engine. Recently,the closed-source translation back-endhas been replaced by a free/open-sourcesolution completely managed by Softcatalà: the Apertium machine translationplatform and the ScaleMT web serviceframework. Thanks to the opennessof the new solution, it is possibleto take advantage of the huge amount ofusers of the Softcatalà translation serviceto improve it, using a series ofmethods presented in this paper. In addition,a study of the translations requestedby the users has been carriedout, and it shows that the translationback-end change has not affected theusage patterns.
Resumo:
Statistical Machine Translation (SMT) is one of the potential applications in the field of Natural Language Processing. The translation process in SMT is carried out by acquiring translation rules automatically from the parallel corpora. However, for many language pairs (e.g. Malayalam- English), they are available only in very limited quantities. Therefore, for these language pairs a huge portion of phrases encountered at run-time will be unknown. This paper focuses on methods for handling such out-of-vocabulary (OOV) words in Malayalam that cannot be translated to English using conventional phrase-based statistical machine translation systems. The OOV words in the source sentence are pre-processed to obtain the root word and its suffix. Different inflected forms of the OOV root are generated and a match is looked up for the word variants in the phrase translation table of the translation model. A Vocabulary filter is used to choose the best among the translations of these word variants by finding the unigram count. A match for the OOV suffix is also looked up in the phrase entries and the target translations are filtered out. Structuring of the filtered phrases is done and SMT translation model is extended by adding OOV with its new phrase translations. By the results of the manual evaluation done it is observed that amount of OOV words in the input has been reduced considerably
Resumo:
Machine translation has been a particularly difficult problem in the area of Natural Language Processing for over two decades. Early approaches to translation failed since interaction effects of complex phenomena in part made translation appear to be unmanageable. Later approaches to the problem have succeeded (although only bilingually), but are based on many language-specific rules of a context-free nature. This report presents an alternative approach to natural language translation that relies on principle-based descriptions of grammar rather than rule-oriented descriptions. The model that has been constructed is based on abstract principles as developed by Chomsky (1981) and several other researchers working within the "Government and Binding" (GB) framework. Thus, the grammar is viewed as a modular system of principles rather than a large set of ad hoc language-specific rules.
Resumo:
This paper describes a preprocessing module for improving the performance of a Spanish into Spanish Sign Language (Lengua de Signos Espanola: LSE) translation system when dealing with sparse training data. This preprocessing module replaces Spanish words with associated tags. The list with Spanish words (vocabulary) and associated tags used by this module is computed automatically considering those signs that show the highest probability of being the translation of every Spanish word. This automatic tag extraction has been compared to a manual strategy achieving almost the same improvement. In this analysis, several alternatives for dealing with non-relevant words have been studied. Non-relevant words are Spanish words not assigned to any sign. The preprocessing module has been incorporated into two well-known statistical translation architectures: a phrase-based system and a Statistical Finite State Transducer (SFST). This system has been developed for a specific application domain: the renewal of Identity Documents and Driver's License. In order to evaluate the system a parallel corpus made up of 4080 Spanish sentences and their LSE translation has been used. The evaluation results revealed a significant performance improvement when including this preprocessing module. In the phrase-based system, the proposed module has given rise to an increase in BLEU (Bilingual Evaluation Understudy) from 73.8% to 81.0% and an increase in the human evaluation score from 0.64 to 0.83. In the case of SFST, BLEU increased from 70.6% to 78.4% and the human evaluation score from 0.65 to 0.82.
Resumo:
Statistical machine translation (SMT) is an approach to Machine Translation (MT) that uses statistical models whose parameter estimation is based on the analysis of existing human translations (contained in bilingual corpora). From a translation student’s standpoint, this dissertation aims to explain how a phrase-based SMT system works, to determine the role of the statistical models it uses in the translation process and to assess the quality of the translations provided that system is trained with in-domain goodquality corpora. To that end, a phrase-based SMT system based on Moses has been trained and subsequently used for the English to Spanish translation of two texts related in topic to the training data. Finally, the quality of this output texts produced by the system has been assessed through a quantitative evaluation carried out with three different automatic evaluation measures and a qualitative evaluation based on the Multidimensional Quality Metrics (MQM).
Resumo:
Mode of access: Internet.
Resumo:
Word Sense Disambiguation, the process of identifying the meaning of a word in a sentence when the word has multiple meanings, is a critical problem of machine translation. It is generally very difficult to select the correct meaning of a word in a sentence, especially when the syntactical difference between the source and target language is big, e.g., English-Korean machine translation. To achieve a high level of accuracy of noun sense selection in machine translation, we introduced a statistical method based on co-occurrence relation of words in sentences and applied it to the English-Korean machine translator RyongNamSan. ACM Computing Classification System (1998): I.2.7.
Resumo:
Users seeking information may not find relevant information pertaining to their information need in a specific language. But information may be available in a language different from their own, but users may not know that language. Thus users may experience difficulty in accessing the information present in different languages. Since the retrieval process depends on the translation of the user query, there are many issues in getting the right translation of the user query. For a pair of languages chosen by a user, resources, like incomplete dictionary, inaccurate machine translation system may exist. These resources may be insufficient to map the query terms in one language to its equivalent terms in another language. Also for a given query, there might exist multiple correct translations. The underlying corpus evidence may suggest a clue to select a probable set of translations that could eventually perform a better information retrieval. In this paper, we present a cross language information retrieval approach to effectively retrieve information present in a language other than the language of the user query using the corpus driven query suggestion approach. The idea is to utilize the corpus based evidence of one language to improve the retrieval and re-ranking of news documents in the other language. We use FIRE corpora - Tamil and English news collections in our experiments and illustrate the effectiveness of the proposed cross language information retrieval approach.