7 resultados para Corpus annotation

em Bulgarian Digital Mathematics Library at IMI-BAS


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This article briefly reviews multilingual language resources for Bulgarian, developed in the frame of some international projects: the first-ever annotated Bulgarian MTE digital lexical resources, Bulgarian-Polish corpus, Bulgarian-Slovak parallel and aligned corpus, and Bulgarian-Polish-Lithuanian corpus. These resources are valuable multilingual dataset for language engineering research and development for Bulgarian language. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The paper presents our considerations related to the creation of a digital corpus of Bulgarian dialects. The dialectological archive of Bulgarian language consists of more than 250 audio tapes. All tapes were recorded between 1955 and 1965 in the course of regular dialectological expeditions throughout the country. The records typically contain interviews with inhabitants of small villages in Bulgaria. The topics covered are usually related to such issues as birth, everyday life, marriage, family relationship, death, etc. Only a few tapes contain folk songs from different regions of the country. Taking into account the progressive deterioration of the magnetic media and the realistic prospects of data loss, the Institute for Bulgarian Language at the Academy of Sciences launched in 1997 a project aiming at restoration and digital preservation of the dialectological archive. Within the framework of this project more than the half of the records was digitized, de-noised and stored on digital recording media. Since then restoration and digitization activities are done in the Institute on a regular basis. As a result a large collection of sound files has been gathered. Our further efforts are aimed at the creation of a digital corpus of Bulgarian dialects, which will be made available for phonological and linguistic research. Such corpora typically include besides the sound files two basic elements: a transcription, aligned with the sound file, and a set of standardized metadata that defines the corpus. In our work we will present considerations on how these tasks could be realized in the case of the corpus of Bulgarian dialects. Our suggestions will be based on a comparative analysis of existing methods and techniques to build such corpora, and by selecting the ones that fit closer to the particular needs. Our experience can be used in similar institutions storing folklore archives, history related spoken records etc.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

* The following text has been originally published in the Proceedings of the Language Recourses and Evaluation Conference held in Lisbon, Portugal, 2004, under the title of "Towards Intelligent Written Cultural Heritage Processing - Lexical processing". I present here a revised contribution of the aforementioned paper and I add here the latest efforts done in the Center for Computational Linguistic in Prague in the field under discussion.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper describes the followed methodology to automatically generate titles for a corpus of questions that belong to sociological opinion polls. Titles for questions have a twofold function: (1) they are the input of user searches and (2) they inform about the whole contents of the question and possible answer options. Thus, generation of titles can be considered as a case of automatic summarization. However, the fact that summarization had to be performed over very short texts together with the aforementioned quality conditions imposed on new generated titles led the authors to follow knowledge-rich and domain-dependent strategies for summarization, disregarding the more frequent extractive techniques for summarization.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The key to prosperity in today's world is access to digital content and skills to create new content. Investigations of folklore artifacts is the topic of this article, presenting research related to the national program „Knowledge Technologies for Creation of Digital Presentation and Significant Repositories of Folklore Heritage” (FolkKnow). FolkKnow aims to build a digital multimedia archive "Bulgarian Folklore Heritage” (BFH) and virtual information portal with folk media library of digitized multimedia objects from a selected collection of the fund of Institute of Ethnology and Folklore Studies with Ethnographic Museum (IEFSEM) of the Bulgarian Academy of Science (BAS). The realization of the project FolkKnow gives opportunity for wide social applications of the multimedia collections, for the purposes of Interactive distance learning/self-learning, research activities in the field of Bulgarian traditional culture and for the cultural and ethno-tourism. We study, analyze and implement techniques and methods for digitization of multimedia objects and their annotation. In the paper are discussed specifics approaches used to building and protect a digital archive with multimedia content. Tasks can be systematized in the following guidelines: * Digitization of the selected samples * Analysis of the objects in order to determine the metadata of selected artifacts from selected collections and problem areas * Digital multimedia archive * Socially-oriented applications and virtual exhibitions artery * Frequency dictionary tool for texts with folklore themes * A method of modern technologies of protecting intellectual property and copyrights on digital content developed for use in digital exposures.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The paper relates about our ongoing work on the creation of a corpus of Bulgarian and Ukrainian parallel texts. We discuss some differences in the approaches and the interpretation of some concepts, as well as various problems associated with the construction of our corpus, in particular the occasional ‘nonparallelism’ of original and translated texts. We give examples of the application of the parallel corpus for the study of lexical semantics and note the outstanding role of the corpus in the lexicographic description of Ukrainian and Bulgarian translation equivalents. We draw attention to the importance of creating parallel corpora as objects of national as well as global cultural heritage.