872 resultados para Corpus comparable


Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper is a progress report on a research path I first outlined in my contribution to “Words in Context: A Tribute to John Sinclair on his Retirement” (Heffer and Sauntson, 2000). Therefore, I first summarize that paper here, in order to provide the relevant background. The second half of the current paper consists of some further manual analyses, exploring various parameters and procedures that might assist in the design of an automated computational process for the identification of lexical sets. The automation itself is beyond the scope of the current paper.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A set of full-color images of objects is described for use in experiments investigating the effects of in-depth rotation on the identification of three-dimensional objects. The corpus contains up to 11 perspective views of 70 nameable objects. We also provide ratings of the "goodness" of each view, based on Thurstonian scaling of subjects' preferences in a paired-comparison experiment. An exploratory cluster analysis on the scaling solutions indicates that the amount of information available in a given view generally is the major determinant of the goodness of the view. For instance, objects with an elongated front-back axis tend to cluster together, and the front and back views of these objects, which do not reveal the object's major surfaces and features, are evaluated as the worst views.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This study presents a detailed contrastive description of the textual functioning of connectives in English and Arabic. Particular emphasis is placed on the organisational force of connectives and their role in sustaining cohesion. The description is intended as a contribution for a better understanding of the variations in the dominant tendencies for text organisation in each language. The findings are expected to be utilised for pedagogical purposes, particularly in improving EFL teaching of writing at the undergraduate level. The study is based on an empirical investigation of the phenomenon of connectivity and, for optimal efficiency, employs computer-aided procedures, particularly those adopted in corpus linguistics, for investigatory purposes. One important methodological requirement is the establishment of two comparable and statistically adequate corpora, also the design of software and the use of existing packages and to achieve the basic analysis. Each corpus comprises ca 250,000 words of newspaper material sampled in accordance to a specific set of criteria and assembled in machine readable form prior to the computer-assisted analysis. A suite of programmes have been written in SPITBOL to accomplish a variety of analytical tasks, and in particular to perform a battery of measurements intended to quantify the textual functioning of connectives in each corpus. Concordances and some word lists are produced by using OCP. Results of these researches confirm the existence of fundamental differences in text organisation in Arabic in comparison to English. This manifests itself in the way textual operations of grouping and sequencing are performed and in the intensity of the textual role of connectives in imposing linearity and continuity and in maintaining overall stability. Furthermore, computation of connective functionality and range of operationality has identified fundamental differences in the way favourable choices for text organisation are made and implemented.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

University students encounter difficulties with academic English because of its vocabulary, phraseology, and variability, and also because academic English differs in many respects from general English, the language which they have experienced before starting their university studies. Although students have been provided with many dictionaries that contain some helpful information on words used in academic English, these dictionaries remain focused on the uses of words in general English. There is therefore a gap in the dictionary market for a dictionary for university students, and this thesis provides a proposal for such a dictionary (called the Dictionary of Academic English; DOAE) in the form of a model which depicts how the dictionary should be designed, compiled, and offered to students. The model draws on state-of-the-art techniques in lexicography, dictionary-use research, and corpus linguistics. The model demanded the creation of a completely new corpus of academic language (Corpus of Academic Journal Articles; CAJA). The main advantages of the corpus are its large size (83.5 million words) and balance. Having access to a large corpus of academic language was essential for a corpus-driven approach to data analysis. A good corpus balance in terms of domains enabled a detailed domain-labelling of senses, patterns, collocates, etc. in the dictionary database, which was then used to tailor the output according to the needs of different types of student. The model proposes an online dictionary that is designed as an online dictionary from the outset. The proposed dictionary is revolutionary in the way it addresses the needs of different types of student. It presents students with a dynamic dictionary whose contents can be customised according to the user's native language, subject of study, variant spelling preferences, and/or visual preferences (e.g. black and white).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Based on a corpus of English, German, and Polish spoken academic discourse, this article analyzes the distribution and function of humor in academic research presentations. The corpus is the result of a European research cooperation project consisting of 300,000 tokens of spoken academic language, focusing on the genres research presentation, student presentation, and oral examination. The article investigates difference between the German and English research cultures as expressed in the genre of specialist research presentations, and the role of humor as a pragmatic device in their respective contexts. The data is analyzed according to the paradigms of corpus-assisted discourse studies (CADS). The findings show that humor is used in research presentations as an expression of discourse reflexivity. They also reveal a considerable difference in the quantitative distribution of humor in research presentations depending on the educational, linguistic, and cultural background of the presenters, thus confirming the notion of different research cultures. Such research cultures nurture distinct attitudes to genres of academic language: whereas in one of the cultures identified researchers conform with the constraints and structures of the genre, those working in another attempt to subvert them, for example by the application of humor. © 2012 Elsevier B.V.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper discusses three important aspects of John Sinclair’s legacy: the corpus, lexicography, and the notion of ‘corpus-driven’. The corpus represents his concern with the nature of linguistic evidence. Lexicography is for him the canonical mode of language description at the lexical level. And his belief that the corpus should ‘drive’ the description is reflected in his constant attempts to utilize the emergent computer technologies to automate the initial stages of analysis and defer the intuitive, interpretative contributions of linguists to increasingly later stages in the process. Sinclair’s model of corpus-driven lexicography has spread far beyond its initial implementation at Cobuild, to most EFL dictionaries, to native-speaker dictionaries (e.g. the New Oxford Dictionary of English, and many national language dictionaries in emerging or re-emerging speech communities) and bilingual dictionaries (e.g. Collins, Oxford-Hachette).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

UK universities are accepting increasing numbers of students whose L1 is not English on a wide range of programmes at all levels. These students require additional support and training in English, focussing on their academic disciplines. Corpora have been used in EAP since the 1980s, mainly for research, but a growing number of researchers and practitioners have been advocating the use of corpora in EAP pedagogy, and such use is gradually increasing. This paper outlines the processes and factors to be considered in the design and compilation of an EAP corpus (e.g., the selection and acquisition of texts, metadata, data annotation, software tools and outputs, web interface, and screen displays), especially one intended to be used for teaching. Such a corpus would also facilitate EAP research in terms of longitudinal studies, student progression and development, and course and materials design. The paper has been informed by the preparatory work on the EAP subcorpus of the ACORN corpus project at Aston University. © 2007 Elsevier Ltd. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An expanding corpus of research details the relationship between functional magnetic resonance imaging (fMRI) measures and neuronal network oscillations. Typically, integratedelectroencephalography(EEG) and fMRI,orparallel magnetoencephalography (MEG) and fMRI are used to draw inference about the consanguinity of BOLD and electrical measurements. However, there is a relative dearth of information about the relationship between E/MEG and the focal networks from which these signals emanate. Consequently, the genesis and composition of E/MEG oscillations requires further clarification. Here we aim to contribute to understanding through a series of parallel measurements of primary motor cortex (M1) oscillations, using human MEG and in-vitro rodent local field potentials. We compare spontaneous activity in the ~10Hz mu and 15-30Hz beta frequency ranges and compare MEG signals with independent and integrated layers III and V(LIII/LV) from in vitro recordings. We explore the mechanisms of oscillatory generation, using specific pharmacological modulation with the GABA-A alpha-1 subunit modulator zolpidem. Finally, to determine the contribution of cortico-cortical connectivity, we recorded in-vitro M1, during an incision to sever lateral connections between M1 and S1 cortices. We demonstrate that frequency distribution of MEG signals appear have closer statistically similarity with signals from integrated rather than independent LIII/LV laminae. GABAergic modulation in both modalities elicited comparable changes in the power of the beta band. Finally, cortico-cortical connectivity in sensorimotor cortex (SMC) appears to directly influence the power of the mu rhythm in LIII. These findings suggest that the MEG signal is an amalgam of outputs from LIII and LV, that multiple frequencies can arise from the same cortical area and that in vitro and MEG M1 oscillations are driven by comparable mechanisms. Finally, corticocortical connectivity is reflected in the power of the SMC mu rhythm. © 2013 Ronnqvist, Mcallister, Woodhall, Stanford and Hall.