34 resultados para Comparable Corpus
em Aston University Research Archive
Resumo:
We propose a hybrid generative/discriminative framework for semantic parsing which combines the hidden vector state (HVS) model and the hidden Markov support vector machines (HM-SVMs). The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. The HM-SVMs combine the advantages of the hidden Markov models and the support vector machines. By employing a modified K-means clustering method, a small set of most representative sentences can be automatically selected from an un-annotated corpus. These sentences together with their abstract annotations are used to train an HVS model which could be subsequently applied on the whole corpus to generate semantic parsing results. The most confident semantic parsing results are selected to generate a fully-annotated corpus which is used to train the HM-SVMs. The proposed framework has been tested on the DARPA Communicator Data. Experimental results show that an improvement over the baseline HVS parser has been observed using the hybrid framework. When compared with the HM-SVMs trained from the fully-annotated corpus, the hybrid framework gave a comparable performance with only a small set of lightly annotated sentences. © 2008. Licensed under the Creative Commons.
Resumo:
In this article I argue that the study of the linguistic aspects of epistemology has become unhelpfully focused on the corpus-based study of hedging and that a corpus-driven approach can help to improve upon this. Through focusing on a corpus of texts from one discourse community (that of genetics) and identifying frequent tri-lexical clusters containing highly frequent lexical items identified as keywords, I undertake an inductive analysis identifying patterns of epistemic significance. Several of these patterns are shown to be hedging devices and the whole corpus frequencies of the most salient of these, candidate and putative, are then compared to the whole corpus frequencies for comparable wordforms and clusters of epistemic significance. Finally I interviewed a ‘friendly geneticist’ in order to check my interpretation of some of the terms used and to get an expert interpretation of the overall findings. In summary I argue that the highly unexpected patterns of hedging found in genetics demonstrate the value of adopting a corpus-driven approach and constitute an advance in our current understanding of how to approach the relationship between language and epistemology.
Resumo:
Corpus Linguistics is a young discipline. The earliest work was done in the 1960s, but corpora only began to be widely used by lexicographers and linguists in the late 1980s, by language teachers in the late 1990s, and by language students only very recently. This course in corpus linguistics was held at the Departamento de Linguistica Aplicada, E.T.S.I. de Minas, Universidad Politecnica de Madrid from June 15-19 1998. About 45 teachers registered for the course. 30% had PhDs in linguistics, 20% in literature, and the rest were doctorandi or qualified English teachers. The course was designed to introduce the use of corpora and other computational resources in teaching and research, with special reference to scientific and technological discourse in English. Each participant had a computer networked with the lecturer’s machine, whose display could be projected onto a large screen. Application programs were loaded onto the central server, and telnet and a web browser were available. COBUILD gave us permission to access the 323 million word Bank of English corpus, Mike Scott allowed us to use his Wordsmith Tools software, and Tim Johns gave us a copy of his MicroConcord program.
Resumo:
Based on Goffman’s definition that frames are general ‘schemata of interpretation’ that people use to ‘locate, perceive, identify, and label’, other scholars have used the concept in a more specific way to analyze media coverage. Frames are used in the sense of organizing devices that allow journalists to select and emphasise topics, to decide ‘what matters’ (Gitlin 1980). Gamson and Modigliani (1989) consider frames as being embedded within ‘media packages’ that can be seen as ‘giving meaning’ to an issue. According to Entman (1993), framing comprises a combination of different activities such as: problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described. Previous research has analysed climate change with the purpose of testing Downs’s model of the issue attention cycle (Trumbo 1996), to uncover media biases in the US press (Boykoff and Boykoff 2004), to highlight differences between nations (Brossard et al. 2004; Grundmann 2007) or to analyze cultural reconstructions of scientific knowledge (Carvalho and Burgess 2005). In this paper we shall present data from a corpus linguistics-based approach. We will be drawing on results of a pilot study conducted in Spring 2008 based on the Nexis news media archive. Based on comparative data from the US, the UK, France and Germany, we aim to show how the climate change issue has been framed differently in these countries and how this framing indicates differences in national climate change policies.
Resumo:
This paper asserts the increasing importance of academic English in an increasingly Anglophone world, and looks at the differences between academic English and general English, especially in terms of vocabulary. The creation of wordlists has played an important role in trying to establish the academic English lexicon, but these wordlists are not based on appropriate data, or are implemented inappropriately. There is as yet no adequate dictionary of academic English, and this paper reports on new efforts at Aston University to create a suitable corpus on which such a dictionary could be based.
Resumo:
This paper is a progress report on a research path I first outlined in my contribution to “Words in Context: A Tribute to John Sinclair on his Retirement” (Heffer and Sauntson, 2000). Therefore, I first summarize that paper here, in order to provide the relevant background. The second half of the current paper consists of some further manual analyses, exploring various parameters and procedures that might assist in the design of an automated computational process for the identification of lexical sets. The automation itself is beyond the scope of the current paper.
Resumo:
Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified.