9 resultados para Discourse Markers
em BORIS: Bern Open Repository and Information System - Berna - Suiça
Resumo:
The lexical items like and well can serve as discourse markers (DMs), but can also play numerous other roles, such as verb or adverb. Identifying the occurrences that function as DMs is an important step for language understanding by computers. In this study, automatic classifiers using lexical, prosodic/positional and sociolinguistic features are trained over transcribed dialogues, manually annotated with DM information. The resulting classifiers improve state-of-the-art performance of DM identification, at about 90% recall and 79% precision for like (84.5% accuracy, κ = 0.69), and 99% recall and 98% precision for well (97.5% accuracy, κ = 0.88). Automatic feature analysis shows that lexical collocations are the most reliable indicators, followed by prosodic/positional features, while sociolinguistic features are marginally useful for the identification of DM like and not useful for well. The differentiated processing of each type of DM improves classification accuracy, suggesting that these types should be treated individually.
Resumo:
This article discusses the detection of discourse markers (DM) in dialog transcriptions, by human annotators and by automated means. After a theoretical discussion of the definition of DMs and their relevance to natural language processing, we focus on the role of like as a DM. Results from experiments with human annotators show that detection of DMs is a difficult but reliable task, which requires prosodic information from soundtracks. Then, several types of features are defined for automatic disambiguation of like: collocations, part-of-speech tags and duration-based features. Decision-tree learning shows that for like, nearly 70% precision can be reached, with near 100% recall, mainly using collocation filters. Similar results hold for well, with about 91% precision at 100% recall.
Resumo:
Kosrae is the most remote island of the Federated States of Micronesia (FSM), with a population of less than 7,000 inhabitants, located in the Pacific Ocean between Hawaii and Guam. FSM is an independent sovereign nation consisting of four states in total: Pohnpei, Chuuk, Yap, and Kosrae. Having passed through the hands of Spain, Germany and Japan, the United States gained administrative control of FSM after WWII, as commissioned by the UN. The FSM became an independent nation in 1986 while still retaining affiliation with the US under a ‘Compact of Free Association’. Now both Kosraean and English are considered to be the two official languages and the variety of Kosraean English which has arisen proves for an interesting comparative study. In order to obtain the relevant data, I spent three months on the island of Kosrae, interviewing 90 local speakers, ranging in age (16-70), occupation, sex and time spent off island. The 45 minute long interviews were informal but supported by participant information to capture relevant data and conversations were guided in a way that aimed to reveal language and cultural attitudes. With reference to these samples, I examine the effects of American English on the language use in Kosrae. This paper aims to present a broad analysis of phonological, morphosyntactic and pragmatic features, such as pro-dropping, discourse markers and other practices in order to demonstrate the similarities and differences between the two varieties, which are coming to shape the variety developing on Kosrae. Having transcribed conversations using the tool Elan, I will put particular focus on [h] deletion and insertion, a rare occurrence found in a variety of post-colonial American English which I believe is of particular interest. I assess the presence of English in Kosrae with reference to sociological influences, past and present. First, I discuss the extralinguistic factors which have shaped the English that is currently used on Kosrae, including migration between US and FSM, and English as a language of administration, social media usage and visual media presence. Secondly, I assess the use of English in this community in light of Schneider’s (2007) ‘Dynamic Model’, with reference to America’s contribution as an ‘exploitation colony’ as defined by Mufwene (2001). Finally, an overview of the salient linguistic characteristics of Kosraean English, based on the data collected will be presented and compared to features associated with standard American English in view of examining overlap and divergence. The overall objective is to present a cross-linguistic description of a hitherto unexamined English emerging in a postcolonial environment with a juxtaposed contact variety. Mufwene, Salikoko S. 2001. The ecology of language evolution. Cambridge: Cambridge University Press. Schneider, E. (2007). Postcolonial Englishes. Cambridge: Cambridge University Press. Segal, H.G. (1989) Kosrae, The Sleeping Lady Awakens. Kosrae: Kosrae Tourist Division, Dept. Of Conservation and Development. Keywords: American English, Global English, Pacific English, Morphosyntactic, Phonological, Variation, Discourse
Resumo:
This paper presents a shallow dialogue analysis model, aimed at human-human dialogues in the context of staff or business meetings. Four components of the model are defined, and several machine learning techniques are used to extract features from dialogue transcripts: maximum entropy classifiers for dialogue acts, latent semantic analysis for topic segmentation, or decision tree classifiers for discourse markers. A rule-based approach is proposed for solving cross-modal references to meeting documents. The methods are trained and evaluated thanks to a common data set and annotation format. The integration of the components into an automated shallow dialogue parser opens the way to multimodal meeting processing and retrieval applications.
Annotating discourse connectives by looking at their translation: The translation-spotting technique
Resumo:
The various meanings of discourse connectives like while and however are difficult to identify and annotate, even for trained human annotators. This problem is all the more important that connectives are salient textual markers of cohesion and need to be correctly interpreted for many NLP applications. In this paper, we suggest an alternative route to reach a reliable annotation of connectives, by making use of the information provided by their translation in large parallel corpora. This method thus replaces the difficult explicit reasoning involved in traditional sense annotation by an empirical clustering of the senses emerging from the translations. We argue that this method has the advantage of providing more reliable reference data than traditional sense annotation. In addition, its simplicity allows for the rapid constitution of large annotated datasets.