969 resultados para linguistic corpora
Resumo:
We introduce a new tool for correcting OCR errors of materials in a repository of cultural materials. The poster is aimed to all who are interested in digital humanities and who might find our tool useful. The poster will focus on the OCR correction tool and on the background processes. We have started a project on materials published in Finno-Ugric languages in the Soviet Union in the 1920s and 1930s. The materials are digitised in Russia. As they arrive, we publish them in DSpace (fennougrica.kansalliskirjasto.fi). For research purposes, the results of the OCR must be corrected manually. For this we have built a new tool. Although similar tools exist, we found in-house development necessary in order to serve the researchers' needs. The tool enables exporting the corrected text as required by the researchers. It makes it possible to distribute the correction tasks and their supervision. After a supervisor has approved a text as finalised, the new version of the work will replace the old one in DSpace. The project has - benefitted the small language communities, - opened channels for cooperation in Russia. - increased our capabilities in digital humanities. The OCR correction tool will be available to others.
Resumo:
Poster at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014
Resumo:
This thesis is concerned with the role played by software tools in the analysis and dissemination of linguistic corpora and their contribution to a more widespread adoption of corpora in different fields. Chapter 1 contains an overview of some of the most relevant corpus analysis tools available today, presenting their most interesting features and some of their drawbacks. Chapter 2 begins with an explanation of the reasons why none of the available tools appear to satisfy the requirements of the user community and then continues with technical overview of the current status of the new system developed as part of this work. This presentation is followed by highlights of features that make the system appealing to users and corpus builders (i.e. scholars willing to make their corpora available to the public). The chapter concludes with an indication of future directions for the projects and information on the current availability of the software. Chapter 3 describes the design of an experiment devised to evaluate the usability of the new system in comparison to another corpus tool. Usage of the tool was tested in the context of a documentation task performed on a real assignment during a translation class in a master's degree course. In chapter 4 the findings of the experiment are presented on two levels of analysis: firstly a discussion on how participants interacted with and evaluated the two corpus tools in terms of interface and interaction design, usability and perceived ease of use. Then an analysis follows of how users interacted with corpora to complete the task and what kind of queries they submitted. Finally, some general conclusions are drawn and areas for future work are outlined.
Resumo:
The paper relates about our ongoing work on the creation of a corpus of Bulgarian and Ukrainian parallel texts. We discuss some differences in the approaches and the interpretation of some concepts, as well as various problems associated with the construction of our corpus, in particular the occasional ‘nonparallelism’ of original and translated texts. We give examples of the application of the parallel corpus for the study of lexical semantics and note the outstanding role of the corpus in the lexicographic description of Ukrainian and Bulgarian translation equivalents. We draw attention to the importance of creating parallel corpora as objects of national as well as global cultural heritage.
Resumo:
El objetivo del trabajo es determinar si el uso de un grupo de verbos es propio del español de Argentina o si, por el contrario, se extiende a otros países hispanohablantes. Para ello, se analizan el proceso de derivación verbal, la semántica y el carácter neológico de las voces.
Resumo:
The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed with Natural Language Processing techniques. On a global scale, languages such as Portuguese - official in 9 countries - appear on the Web in several varieties, with lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as a new freely available testing corpus, containing different varieties and textual typologies.
Resumo:
Aquest article presenta una mostra dels resultats de l’anàlisi detallada de locucions, col·locacions i altres elements fraseològics i d’ordre de mots significatius quant a la caracterització del cabal de llenguatge literari de Joan Roís de Corella. Aquesta anàlisi es fa amb metodologia interdisciplinar de base de lingüistica de corpus i de diacronia lingüistica, i amb el concurs de les tecnologies de la informació i la comunicació (humanitats digitals), que s’apliquen a l’anàlisi de l’aportació lèxica i estilística d’un autor clau com és Roís de Corella a fide calibrar el grau de sintonia i, alhora, d’especificitat del seu llenguatge literari; en quin grau coincideix el seu llenguatge literari amb el d’altres grans clàssics culturals de la Corona d’Aragó, i en què basa, alhora, Roís de Corella la clau de la seua mestria estilística.
Resumo:
Neste artigo procuramos reflectir sobre a função dos corpora na observação e análise de fenómenos de uma língua natural bem como na criação de novos recursos de exploração linguísticos que as tecnologias de informação têm vindo a potenciar e a tornar mais eficaz.
Resumo:
The automatic acquisition of lexical associations from corpora is a crucial issue for Natural Language Processing. A lexical association is a recurrent combination of words that co-occur together more often than expected by chance in a given domain. In fact, lexical associations define linguistic phenomena such as idiomes, collocations or compound words. Due to the fact that the sense of a lexical association is not compositionnal, their identification is fundamental for the realization of analysis and synthesis that take into account all the subtilities of the language. In this report, we introduce a new statistically-based architecture that extracts from naturally occurring texts contiguous and non contiguous. For that purpose, three new concepts have been defined : the positional N-gram models, the Mutual Expectation and the GenLocalMaxs algorithm. Thus, the initial text is fisrtly transformed in a set of positionnal N-grams i.e ordered vectors of simple lexical units. Then, an association measure, the Mutual Expectation, evaluates the degree of cohesion of each positional N-grams based on the identification of local maximum values of Mutual Expectation. Great efforts have also been carried out to evaluate our metodology. For that purpose, we have proposed the normalisation of five well-known association measures and shown that both the Mutual Expectation and the GenLocalMaxs algorithm evidence significant improvements comparing to existent metodologies.
Resumo:
The business sphere is a multilingual world where foreign language communication skills are crucial in international relations. It makes employers look for business professionals who have a high level of linguistic competences. Language proficiency increases the chances of negotiation among partners. There are mainly two obstacles that make barriers in formal communication in a foreign language: lack of knowledge of specific linguistic structures or terminology and frequent transitions from one language to another. This paper contributes to the quest for quick access to a wide range of English, Spanish and Russian online databases that provide authentic language samples. Their application may improve communication skills and facilitate preparation for business discourse.
Resumo:
CoCo is a collaborative web interface for the compilation of linguistic resources. In this demo we are presenting one of its possible applications: paraphrase acquisition.
Resumo:
The emerging technologies have recently challenged the libraries to reconsider their role as a mere mediator between the collections, researchers, and wider audiences (Sula, 2013), and libraries, especially the nationwide institutions like national libraries, haven’t always managed to face the challenge (Nygren et al., 2014). In the Digitization Project of Kindred Languages, the National Library of Finland has become a node that connects the partners to interplay and work for shared goals and objectives. In this paper, I will be drawing a picture of the crowdsourcing methods that have been established during the project to support both linguistic research and lingual diversity. The National Library of Finland has been executing the Digitization Project of Kindred Languages since 2012. The project seeks to digitize and publish approximately 1,200 monograph titles and more than 100 newspapers titles in various, and in some cases endangered Uralic languages. Once the digitization has been completed in 2015, the Fenno-Ugrica online collection will consist of 110,000 monograph pages and around 90,000 newspaper pages to which all users will have open access regardless of their place of residence. The majority of the digitized literature was originally published in the 1920s and 1930s in the Soviet Union, and it was the genesis and consolidation period of literary languages. This was the era when many Uralic languages were converted into media of popular education, enlightenment, and dissemination of information pertinent to the developing political agenda of the Soviet state. The ‘deluge’ of popular literature in the 1920s to 1930s suddenly challenged the lexical orthographic norms of the limited ecclesiastical publications from the 1880s onward. Newspapers were now written in orthographies and in word forms that the locals would understand. Textbooks were written to address the separate needs of both adults and children. New concepts were introduced in the language. This was the beginning of a renaissance and period of enlightenment (Rueter, 2013). The linguistically oriented population can also find writings to their delight, especially lexical items specific to a given publication, and orthographically documented specifics of phonetics. The project is financially supported by the Kone Foundation in Helsinki and is part of the Foundation’s Language Programme. One of the key objectives of the Kone Foundation Language Programme is to support a culture of openness and interaction in linguistic research, but also to promote citizen science as a tool for the participation of the language community in research. In addition to sharing this aspiration, our objective within the Language Programme is to make sure that old and new corpora in Uralic languages are made available for the open and interactive use of the academic community as well as the language societies. Wordlists are available in 17 languages, but without tokenization, lemmatization, and so on. This approach was verified with the scholars, and we consider the wordlists as raw data for linguists. Our data is used for creating the morphological analyzers and online dictionaries at the Helsinki and Tromsø Universities, for instance. In order to reach the targets, we will produce not only the digitized materials but also their development tools for supporting linguistic research and citizen science. The Digitization Project of Kindred Languages is thus linked with the research of language technology. The mission is to improve the usage and usability of digitized content. During the project, we have advanced methods that will refine the raw data for further use, especially in the linguistic research. How does the library meet the objectives, which appears to be beyond its traditional playground? The written materials from this period are a gold mine, so how could we retrieve these hidden treasures of languages out of the stack that contains more than 200,000 pages of literature in various Uralic languages? The problem is that the machined-encoded text (OCR) contains often too many mistakes to be used as such in research. The mistakes in OCRed texts must be corrected. For enhancing the OCRed texts, the National Library of Finland developed an open-source code OCR editor that enabled the editing of machine-encoded text for the benefit of linguistic research. This tool was necessary to implement, since these rare and peripheral prints did often include already perished characters, which are sadly neglected by the modern OCR software developers, but belong to the historical context of kindred languages and thus are an essential part of the linguistic heritage (van Hemel, 2014). Our crowdsourcing tool application is essentially an editor of Alto XML format. It consists of a back-end for managing users, permissions, and files, communicating through a REST API with a front-end interface—that is, the actual editor for correcting the OCRed text. The enhanced XML files can be retrieved from the Fenno-Ugrica collection for further purposes. Could the crowd do this work to support the academic research? The challenge in crowdsourcing lies in its nature. The targets in the traditional crowdsourcing have often been split into several microtasks that do not require any special skills from the anonymous people, a faceless crowd. This way of crowdsourcing may produce quantitative results, but from the research’s point of view, there is a danger that the needs of linguists are not necessarily met. Also, the remarkable downside is the lack of shared goal or the social affinity. There is no reward in the traditional methods of crowdsourcing (de Boer et al., 2012). Also, there has been criticism that digital humanities makes the humanities too data-driven and oriented towards quantitative methods, losing the values of critical qualitative methods (Fish, 2012). And on top of that, the downsides of the traditional crowdsourcing become more imminent when you leave the Anglophone world. Our potential crowd is geographically scattered in Russia. This crowd is linguistically heterogeneous, speaking 17 different languages. In many cases languages are close to extinction or longing for language revitalization, and the native speakers do not always have Internet access, so an open call for crowdsourcing would not have produced appeasing results for linguists. Thus, one has to identify carefully the potential niches to complete the needed tasks. When using the help of a crowd in a project that is aiming to support both linguistic research and survival of endangered languages, the approach has to be a different one. In nichesourcing, the tasks are distributed amongst a small crowd of citizen scientists (communities). Although communities provide smaller pools to draw resources, their specific richness in skill is suited for complex tasks with high-quality product expectations found in nichesourcing. Communities have a purpose and identity, and their regular interaction engenders social trust and reputation. These communities can correspond to research more precisely (de Boer et al., 2012). Instead of repetitive and rather trivial tasks, we are trying to utilize the knowledge and skills of citizen scientists to provide qualitative results. In nichesourcing, we hand in such assignments that would precisely fill the gaps in linguistic research. A typical task would be editing and collecting the words in such fields of vocabularies where the researchers do require more information. For instance, there is lack of Hill Mari words and terminology in anatomy. We have digitized the books in medicine, and we could try to track the words related to human organs by assigning the citizen scientists to edit and collect words with the OCR editor. From the nichesourcing’s perspective, it is essential that altruism play a central role when the language communities are involved. In nichesourcing, our goal is to reach a certain level of interplay, where the language communities would benefit from the results. For instance, the corrected words in Ingrian will be added to an online dictionary, which is made freely available for the public, so the society can benefit, too. This objective of interplay can be understood as an aspiration to support the endangered languages and the maintenance of lingual diversity, but also as a servant of ‘two masters’: research and society.
Resumo:
The National Library of Finland is implementing the Digitization Project of Kindred Languages in 2012–16. Within the project we will digitize materials in the Uralic languages as well as develop tools to support linguistic research and citizen science. Through this project, researchers will gain access to new corpora 329 and to which all users will have open access regardless of their place of residence. Our objective is to make sure that the new corpora are made available for the open and interactive use of both the academic community and the language societies as a whole. The project seeks to digitize and publish approximately 1200 monograph titles and more than 100 newspapers titles in various Uralic languages. The digitization will be completed by the early of 2015, when the Fenno-Ugrica collection would contain around 200 000 pages of editable text. The researchers cannot spend so much time with the material that they could retrieve a satisfactory amount of edited words, so the participation of a crowd in editing work is needed. Often the targets in crowdsourcing have been split into several microtasks that do not require any special skills from the anonymous people, a faceless crowd. This way of crowdsourcing may produce quantitative results, but from the research’s point of view, there is a danger that the needs of linguistic research are not necessarily met. Also, the number of pages is too high to deal with. The remarkable downside is the lack of shared goal or social affinity. There is no reward in traditional methods of crowdsourcing. Nichesourcing is a specific type of crowdsourcing where tasks are distributed amongst a small crowd of citizen scientists (communities). Although communities provide smaller pools to draw resources, their specific richness in skill is suited for the complex tasks with high-quality product expectations found in nichesourcing. Communities have purpose, identity and their regular interactions engenders social trust and reputation. These communities can correspond to research more precisely. Instead of repetitive and rather trivial tasks, we are trying to utilize the knowledge and skills of citizen scientists to provide qualitative results. Some selection must be made, since we are not aiming to correct all 200,000 pages which we have digitized, but give such assignments to citizen scientists that would precisely fill the gaps in linguistic research. A typical task would editing and collecting the words in such fields of vocabularies, where the researchers do require more information. For instance, there’s a lack of Hill Mari words in anatomy. We have digitized the books in medicine and we could try to track the words related to human organs by assigning the citizen scientists to edit and collect words with OCR editor. From the nichesourcing’s perspective, it is essential that the altruism plays a central role, when the language communities involve. Upon the nichesourcing, our goal is to reach a certain level of interplay, where the language communities would benefit on the results. For instance, the corrected words in Ingrian will be added onto the online dictionary, which is made freely available for the public and the society can benefit too. This objective of interplay can be understood as an aspiration to support the endangered languages and the maintenance of lingual diversity, but also as a servant of “two masters”, the research and the society.