969 resultados para linguistic corpora
Resumo:
This research focuses on Native Language Identification (NLID), and in particular, on the linguistic identifiers of L1 Persian speakers writing in English. This project comprises three sub-studies; the first study devises a coding system to account for interlingual features present in a corpus of L1 Persian speakers blogging in English, and a corpus of L1 English blogs. Study One then demonstrates that it is possible to use interlingual identifiers to distinguish authorship by L1 Persian speakers. Study Two examines the coding system in relation to the L1 Persian corpus and a corpus of L1 Azeri and L1 Pashto speakers. The findings of this section indicate that the NLID method and features designed are able to discriminate between L1 influences from different languages. Study Three focuses on elicited data, in which participants were tasked with disguising their language to appear as L1 Persian speakers writing in English. This study indicated that there was a significant difference between the features in the L1 Persian corpus, and the corpus of disguise texts. The findings of this research indicate that NLID and the coding system devised have a very strong potential to aid forensic authorship analysis in investigative situations. Unlike existing research, this project focuses predominantly on blogs, as opposed to student data, making the findings more appropriate to forensic casework data.
Resumo:
'Double-voicing' means that when a person speaks, they have a heightened awareness of the concerns and agendas of others, which is reflected in the ways they adjust their language in response to interlocutors. The Russian philosopher Mikhail Bakhtin famously applied the concept of 'double-voiced discourse' to the world of literature, but just touched upon its relevance to everyday language. This book reveals how 'double-voicing' is an inherent and routine part of spoken interactions within educational and professional contexts. Double-voicing is closely related to the ways in which power relations are constructed between speakers, as it is often used by less powerful speakers to negotiate perceived threats from more powerful others. The book explores how women leaders use double-voicing more than men as a means of gaining acceptance and approval in the workplace. While double-voicing at times indexes a speaker's linguistic insecurity, the book argues that it can be harnessed to demonstrate linguistic expertise.
Resumo:
'Double-voicing' means that when a person speaks, they have a heightened awareness of the concerns and agendas of others, which is reflected in the ways they adjust their language in response to interlocutors. The Russian philosopher Mikhail Bakhtin famously applied the concept of 'double-voiced discourse' to the world of literature, but just touched upon its relevance to everyday language. This book reveals how 'double-voicing' is an inherent and routine part of spoken interactions within educational and professional contexts. Double-voicing is closely related to the ways in which power relations are constructed between speakers, as it is often used by less powerful speakers to negotiate perceived threats from more powerful others. The book explores how women leaders use double-voicing more than men as a means of gaining acceptance and approval in the workplace. While double-voicing at times indexes a speaker's linguistic insecurity, the book argues that it can be harnessed to demonstrate linguistic expertise.
Resumo:
Corpora—large collections of written and/or spoken text stored and accessed electronically—provide the means of investigating language that is of growing importance academically and professionally. Corpora are now routinely used in the following fields: The production of dictionaries and other reference materials; The development of aids to translation; Language teaching materials; The investigation of ideologies and cultural assumptions; Natural language processing; and The investigation of all aspects of linguistic behaviour, including vocabulary, grammar and pragmatics.
Resumo:
Translation training in the university context needs to train students in the processes, in order to enhance and optimise the product as outcome of these processes. Evaluation of a target text as product has often been accused of being a subjective process, which does not easily lend itself to the type of feedback that could enable students to apply criteria more widely. For students, it often seems as though they make different inappropriate or incorrect choices every time they translate a new text, and the learning process appears unpredictable and haphazard. Within functionalist approaches to translation, with their focus on the target text in terms of functional adequacy to the intended purpose, as stipulated in the translation brief, there are guidelines for text production that can help to develop a more systematic approach not only to text production, but also to translation evaluation. In the context of a focus on user knowledge needs, target language conventions and acceptability, the use of corpora is an indispensable tool for the trainee translator. Evaluation can take place against the student's own reasoned selection process, based on hard evidence, against criteria which currently obtain in the TL and the TL culture. When trainee and evaluator work within the same guidelines, there is more scope for constructive learning and feedback.
Resumo:
The paper presents our considerations related to the creation of a digital corpus of Bulgarian dialects. The dialectological archive of Bulgarian language consists of more than 250 audio tapes. All tapes were recorded between 1955 and 1965 in the course of regular dialectological expeditions throughout the country. The records typically contain interviews with inhabitants of small villages in Bulgaria. The topics covered are usually related to such issues as birth, everyday life, marriage, family relationship, death, etc. Only a few tapes contain folk songs from different regions of the country. Taking into account the progressive deterioration of the magnetic media and the realistic prospects of data loss, the Institute for Bulgarian Language at the Academy of Sciences launched in 1997 a project aiming at restoration and digital preservation of the dialectological archive. Within the framework of this project more than the half of the records was digitized, de-noised and stored on digital recording media. Since then restoration and digitization activities are done in the Institute on a regular basis. As a result a large collection of sound files has been gathered. Our further efforts are aimed at the creation of a digital corpus of Bulgarian dialects, which will be made available for phonological and linguistic research. Such corpora typically include besides the sound files two basic elements: a transcription, aligned with the sound file, and a set of standardized metadata that defines the corpus. In our work we will present considerations on how these tasks could be realized in the case of the corpus of Bulgarian dialects. Our suggestions will be based on a comparative analysis of existing methods and techniques to build such corpora, and by selecting the ones that fit closer to the particular needs. Our experience can be used in similar institutions storing folklore archives, history related spoken records etc.
Resumo:
False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the words’ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
Resumo:
This research explores how news media reports construct representations of a business crisis through language. In an innovative approach to dealing with the vast pool of potentially relevant texts, media texts concerning the BP Deepwater Horizon oil spill are gathered from three different time points: immediately after the explosion in 2010, one year later in 2011 and again in 2012. The three sets of 'BP texts' are investigated using discourse analysis and semi-quantitative methods within a semiotic framework that gives an account of language at the semiotic levels of sign, code, mythical meaning and ideology. The research finds in the texts three discourses of representation concerning the crisis that show a movement from the ostensibly representational to the symbolic and conventional: a discourse of 'objective factuality', a discourse of 'positioning' and a discourse of 'redeployment'. This progression can be shown to have useful parallels with Peirce's sign classes of Icon, Index and Symbol, with their implied movement from a clear motivation by the Object (in this case the disaster events), to an arbitrary, socially-agreed connection. However, the naturalisation of signs, whereby ideologies are encoded in ways of speaking and writing that present them as 'taken for granted' is at its most complete when it is least discernible. The findings suggest that media coverage is likely to move on from symbolic representation to a new kind of iconicity, through a fourth discourse of 'naturalisation'. Here the representation turns back towards ostensible factuality or iconicity, to become the 'naturalised icon'. This work adds to the study of media representation a heuristic for understanding how the meaning-making of a news story progresses. It offers a detailed account of what the stages of this progression 'look like' linguistically, and suggests scope for future research into both language characteristics of phases and different news-reported phenomena.
Resumo:
* The following text has been originally published in the Proceedings of the Language Recourses and Evaluation Conference held in Lisbon, Portugal, 2004, under the title of "Towards Intelligent Written Cultural Heritage Processing - Lexical processing". I present here a revised contribution of the aforementioned paper and I add here the latest efforts done in the Center for Computational Linguistic in Prague in the field under discussion.
Resumo:
Systems analysis (SA) is widely used in complex and vague problem solving. Initial stages of SA are analysis of problems and purposes to obtain problems/purposes of smaller complexity and vagueness that are combined into hierarchical structures of problems(SP)/purposes(PS). Managers have to be sure the PS and the purpose realizing system (PRS) that can achieve the PS-purposes are adequate to the problem to be solved. However, usually SP/PS are not substantiated well enough, because their development is based on a collective expertise in which logic of natural language and expert estimation methods are used. That is why scientific foundations of SA are not supposed to have been completely formed. The structure-and-purpose approach to SA based on a logic-and-linguistic simulation of problems/purposes analysis is a step towards formalization of the initial stages of SA to improve adequacy of their results, and also towards increasing quality of SA as a whole. Managers of industrial organizing systems using the approach eliminate logical errors in SP/PS at early stages of planning and so they will be able to find better decisions of complex and vague problems.
Resumo:
This paper presents a research of linguistic structure of Bulgarian bells knowledge. The idea of building semantic structure of Bulgarian bells appeared during the “Multimedia fund - BellKnow” project. In this project was collected a lots of data about bells, their structure, history, technical data, etc. This is the first attempt for computation linguistic explain of bell knowledge and deliver a semantic representation of that knowledge. Based on this research some linguistic components, aiming to realize different types of analysis of text objects are implemented in term dictionaries. Thus, we lay the foundation of the linguistic analysis services in these digital dictionaries aiding the research of kinds, number and frequency of the lexical units that constitute various bell objects.
Resumo:
The paper describes three software packages - the main components of a software system for processing and web-presentation of Bulgarian language resources – parallel corpora and bilingual dictionaries. The author briefly presents current versions of the core components “Dictionary” and “Corpus” as well as the recently developed component “Connection” that links both “Dictionary” and “Corpus”. The components main functionalities are described as well. Some examples of the usage of the system’s web-applications are included.
Resumo:
Relatively little research on dialect variation has been based on corpora of naturally occurring language. Instead, dialect variation has been studied based primarily on language elicited through questionnaires and interviews. Eliciting dialect data has several advantages, including allowing for dialectologists to select individual informants, control the communicative situation in which language is collected, elicit rare forms directly, and make high-quality audio recordings. Although far less common, a corpus-based approach to data collection also has several advantages, including allowing for dialectologists to collect large amounts of data from a large number of informants, observe dialect variation across a range of communicative situations, and analyze quantitative linguistic variation in large samples of natural language. Although both approaches allow for dialect variation to be observed, they provide different perspectives on language variation and change. The corpus- based approach to dialectology has therefore produced a number of new findings, many of which challenge traditional assumptions about the nature of dialect variation. Most important, this research has shown that dialect variation involves a wider range of linguistic variables and exists across a wider range of language varieties than has previously been assumed. The goal of this chapter is to introduce this emerging approach to dialectology. The first part of this chapter reviews the growing body of research that analyzes dialect variation in corpora, including research on variation across nations, regions, genders, ages, and classes, in both speech and writing, and from both a synchronic and diachronic perspective, with a focus on dialect variation in the English language. Although collections of language data elicited through interviews and questionnaires are now commonly referred to as corpora in sociolinguistics and dialectology (e.g. see Bauer 2002; Tagliamonte 2006; Kretzschmar et al. 2006; D'Arcy 2011), this review focuses on corpora of naturally occurring texts and discourse. The second part of this chapter presents the results of an analysis of variation in not contraction across region, gender, and time in a corpus of American English letters to the editor in order to exemplify a corpus-based approach to dialectology.
Resumo:
In this paper, I concentrate on court cases with litigants in person (lay people who act on their own behalf in legal proceedings without a counsel or solicitor) and discuss the challenges of building a corpus of courtroom discourse where it is crucial to distinguish between speakers due to their distinct institutional roles. The corpus incorporates seven sub-corpora of verbatim transcripts from different court cases with litigants in person and comprises over eleven-million tokens. The focus of this paper is on the interplay between the legal and lay discourse types and how judges project their institutional roles through well-initiated turns directed at litigants in person and counsels. As a versatile discourse marker, well provides a good opportunity to explore how judges have to adapt their roles to ensure lay litigants in person receive the necessary support and that their lack of competence does not impede on the fairness of the proceedings. Given the breadth and importance of the topic of litigation in person, I discuss how the tools and approaches of corpus linguistics can be helpful in this multi-disciplinary area where multiple functions and uses of individual linguistic features need to be explored in depth.