Handling OOV Words in Phrase-Based Statistical Machine Translation for Malayalam


Autoria(s): Santhosh Kumar, G; Mary, Priya Sebastian
Data(s)

21/07/2014

21/07/2014

09/02/2013

Resumo

Statistical Machine Translation (SMT) is one of the potential applications in the field of Natural Language Processing. The translation process in SMT is carried out by acquiring translation rules automatically from the parallel corpora. However, for many language pairs (e.g. Malayalam- English), they are available only in very limited quantities. Therefore, for these language pairs a huge portion of phrases encountered at run-time will be unknown. This paper focuses on methods for handling such out-of-vocabulary (OOV) words in Malayalam that cannot be translated to English using conventional phrase-based statistical machine translation systems. The OOV words in the source sentence are pre-processed to obtain the root word and its suffix. Different inflected forms of the OOV root are generated and a match is looked up for the word variants in the phrase translation table of the translation model. A Vocabulary filter is used to choose the best among the translations of these word variants by finding the unigram count. A match for the OOV suffix is also looked up in the phrase entries and the target translations are filtered out. Structuring of the filtered phrases is done and SMT translation model is extended by adding OOV with its new phrase translations. By the results of the manual evaluation done it is observed that amount of OOV words in the input has been reduced considerably

Cochin University of Science and Technology

Identificador

http://dyuthi.cusat.ac.in/purl/4157

Idioma(s)

en

Palavras-Chave #SMT #OOV words #out-of-vocabulary #unknown words #phrase translation #Machine Translation #Malayalam Translation
Tipo

Article