571 resultados para Corpora Pedunculata
Resumo:
This paper describes part of the corpus collection efforts underway in the EC funded Companions project. The Companions project is collecting substantial quantities of dialogue a large part of which focus on reminiscing about photographs. The texts are in English and Czech. We describe the context and objectives for which this dialogue corpus is being collected, the methodology being used and make observations on the resulting data. The corpora will be made available to the wider research community through the Companions Project web site.
Resumo:
Automatic Term Recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.
Resumo:
On the basis of a transcribed French television corpus made of two news bulletins, two chat shows and one literary programme recorded in February 2003, this paper explores the claim that passé simple (PS) may still be used in prepared oral discourse (Pfister 1974). The corpus does not provide support for that use on television, but it seems to suggest a shift from temporal to aspectual features in French television talk: a perfective presentation prevails on a past presentation. This trend would need to be confirmed by a larger television corpus, tested in other types of oral discourse and tested on written corpora.
Resumo:
In this study, we investigate crosslinguistic patterns in the alternation between UM, a hesitation marker consisting of a neutral vowel followed by a final labial nasal, and UH, a hesitation marker consisting of a neutral vowel in an open syllable. Based on a quantitative analysis of a range of spoken and written corpora, we identify clear and consistent patterns of change in the use of these forms in various Germanic languages (English, Dutch, German, Norwegian, Danish, Faroese) and dialects (American English, British English), with the use of UM increasing over time relative to the use of UH. We also find that this pattern of change is generally led by women and more educated speakers. Finally, we propose a series of possible explanations for this surprising change in hesitation marker usage that is currently taking place across Germanic languages.
Resumo:
Information technology has increased both the speed and medium of communication between nations. It has brought the world closer, but it has also created new challenges for translation — how we think about it, how we carry it out and how we teach it. Translation and Information Technology has brought together experts in computational linguistics, machine translation, translation education, and translation studies to discuss how these new technologies work, the effect of electronic tools, such as the internet, bilingual corpora, and computer software, on translator education and the practice of translation, as well as the conceptual gaps raised by the interface of human and machine.
Resumo:
Corpus Linguistics is a young discipline. The earliest work was done in the 1960s, but corpora only began to be widely used by lexicographers and linguists in the late 1980s, by language teachers in the late 1990s, and by language students only very recently. This course in corpus linguistics was held at the Departamento de Linguistica Aplicada, E.T.S.I. de Minas, Universidad Politecnica de Madrid from June 15-19 1998. About 45 teachers registered for the course. 30% had PhDs in linguistics, 20% in literature, and the rest were doctorandi or qualified English teachers. The course was designed to introduce the use of corpora and other computational resources in teaching and research, with special reference to scientific and technological discourse in English. Each participant had a computer networked with the lecturer’s machine, whose display could be projected onto a large screen. Application programs were loaded onto the central server, and telnet and a web browser were available. COBUILD gave us permission to access the 323 million word Bank of English corpus, Mike Scott allowed us to use his Wordsmith Tools software, and Tim Johns gave us a copy of his MicroConcord program.
Resumo:
Corpora—large collections of written and/or spoken text stored and accessed electronically—provide the means of investigating language that is of growing importance academically and professionally. Corpora are now routinely used in the following fields: •the production of dictionaries and other reference materials; •the development of aids to translation; •language teaching materials; •the investigation of ideologies and cultural assumptions; •natural language processing; and •the investigation of all aspects of linguistic behaviour, including vocabulary, grammar and pragmatics.
Resumo:
A word may have many potential meanings, but its actual meaning in any authentic written or spoken text is determined by its context: its collocations, structural patterns, and pragmatic functions. Large language corpora offer access to words in a wide range of natural contexts, which can improve and enrich both language learning and teaching.
Resumo:
J R Firth first gave prominence to collacation in linguistic theory. Halliday, Sinclair, Stubbs, and Hoey have all extended Firth's ideas. Palmer and Hornby recognized the pedagogical value of collocation, and incorporated it into their early EFL dictionaries. More recent EFL dictionaries, based on large, computerized language corpora, have used complex software and statistical measures to gain further insights into the way that collocational patterns are woven into language, and the results are visible in the dictionary entries of later editions. This has fed back into language pedagogy, and is also influencing translation and computational research. © 2006 Elsevier Ltd. All rights reserved.
Resumo:
Native speakers learn their mother tongue slowly, from birth, by the constant repetition of common words and phrases in a variety of contexts and situations, within the language community. As foreign language learners, we face considerable disadvantages when compared to children learning their mother tongue. Foreign language learners start later in life, have less time, have fewer opportunities to experience the language, and learn in the restricted environment of the classroom. Teachers and books give us information about many words and phrases, but it is difficult for us to know what we need to focus on and learn thoroughly, and what is less important. The rules and explanations are often difficult for us to understand. A large language corpus represents roughly the amount and variety of language that a native-speaker experiences in a whole lifetime. Learners can now access language corpora. We can check which words and phrases are important, and quickly discover their common meanings, collocations, and structural patterns. It is easier to remember things that we find out ourselves, rather than things that teachers or books tell us. Each click on the computer keyboard can show us the same information in different ways, so we can understand it more easily. We can also get many more examples from a corpus. Teachers and native-speakers can also use corpora, to confirm and enhance their own knowledge of a language, and prepare exercises to guide their students. Each of us can learn at our own level and at our own speed.
Resumo:
The attention of linguists has increasingly shifted from grammar to lexis. Collocation has emerged as a key feature of lexis. Research using large language corpora has not only helped to identify the significant collocates of individual words but also to confirm the importance of collocation in the language system. John Sinclair has suggested that language operates on two principles: open choice and idiom. If so, then collocation would appear to be the minimal level of idiomaticity. One problem with collocation is that words that habitually co-occur form less distinct, often discontinuous, idiomatic units, whereas grammar generally works with more precisely delineated and contiguous structural units. This paper uses examples from corpus evidence to look at various aspects of collocation.
Resumo:
Les formes du passé constituent un écueil pour l'apprentissage du français langue étrangère ; même les apprenants les plus avancés échouent à maîtriser leur emploi. Si le manque d'équivalence forme à forme entre les temps des langues constitue une difficulté évidente, la complexité sémantique et distributionnelle des tiroirs français ne doit pas être négligée. Grammairiens et linguistes se sont efforcés de fournir des descriptions des tiroirs du passé mais leur travail, comme celui des didacticiens, s'est révélé inégal. Les contributions retenues dans ce volume invitent à la réflexion critique en ce qui concerne les descriptions existantes des temps et les approches de leur enseignement. Sont envisagées la structuration du système ainsi que la constitution des tiroirs du point de vue synchronique, évolutif et contrastif, à partir de corpus de différentes variétés de français. La question de l'enseignement de ces notions à des apprenants du français langue étrangère et maternelle est aussi considérée dans les divers contextes de l'acquisition. C'est dans l'esprit d'un dialogue de plus en plus nécessaire entre application et modélisation qu'est proposé cet ouvrage, qui retiendra l'intérêt tant des praticiens que des théoriciens. The forms of the past constitute a stumbling block for learning foreign language French; even the most advanced learners fail to master their jobs. If the lack of shape to form equivalence between the time of languages is an obvious difficulty, complexity, semantics and distributional drawers French should not be neglected. Grammarians and linguists have tried to provide descriptions of the drawers of the past but their work, such as educationalists, has been uneven. The contributions included in this volume invite critical thinking regarding the existing descriptions of time and approaches to their teaching. Envisaged the structure of the system and the constitution of the drawers of synchronic point of view, evolutionary and contrasts, from corpora of different varieties of French. The issue of teaching these concepts to learners of French foreign language and tongue is also considered in various contexts of acquisition. In the spirit of dialogue increasingly required between application and modeling what this proposed structure, which will retain the interest of both practitioners and theorists.
Resumo:
The aim of this research project is to compare published history textbooks written for upper-secondary/tertiary study in the U.S. and Spain using Halliday's (1994) Theme/Rheme construct. The motivation for using the Theme/Rheme construct to analyze professional texts in the two languages is two-fold. First of all, while there exists a multitude of studies at the grammatical and phonological levels between the two languages, very little analysis has been carried out in comparison at the level of text, beyond that of comparing L1/L2 student writing. Secondly, thematic considerations allow the analyst to highlight areas of textual organization in a systematic way for purposes of comparison. The basic hypothesis tested here rests on the premise that similarity in the social function of the texts results in similar Theme choice and thematic patterning across languages, barring certain linguistic constraints. The corpus for this study consists of 20 texts: 10 from various history textbooks published in the U.S. and 10 from various history textbooks published in Spain. The texts chosen represent a variety of authors, in order to control for author style or preference. Three overall areas of analysis were carried out, representing Halliday's (1994) three metafunctions: the ideational, the interpersonal and the textual. The ideational analysis shows similarities across the two corpora in terms of participant roles and circumstances as Theme, with a slight difference in participants involved in material processes, which is shown to reflect a minor difference in the construal of the field of history in the two cultures. The textual analysis shows overall similarities with respect to text organization, and the interpersonal analysis shows overall similarities as regards the downplay of discrepant interpretations of historical events as well as a low frequency of interactive textual features, manifesting the informational focus of the texts. At the same time, differences in results amongst texts within each of the corpora demonstrate possible effect of subject matter, in many cases, and individual author style in others. Overall, the results confirm that similarity in content, but above all in purpose and audience, result in texts which show similarities in textual features, setting aside certain grammatical constraints.
Resumo:
Despite the growth of spoken academic corpora in recent years, relatively little is known about the language of seminar discussions in higher education. This thesis compares seminar discussions across three disciplinary areas. The aim of this thesis is to uncover the functions and patterns of talk used in different disciplinary discussions and to highlight language on a macro and micro level that would be useful for materials design and teaching purposes. A framework for identifying and analysing genres in spoken language based on Hallidayan Systemic Functional Linguistics (SFL) is used. Stretches of talk sharing a similar purpose and predictable functional staging, termed Discussion Macro Genres (DMGs) are identified. Language is compared across DMGs and across disciplines through use of corpus techniques in conjunction with SFL genre theory. Data for the study comprises just over 180,000 tokens and is drawn from the British Academic Spoken English corpus (BASE), recorded at two universities in the UK. The discipline areas investigated are Arts and Humanities, Social Sciences and Physical Sciences. Findings from this study make theoretical, empirical and methodological contributions to the field of spoken EAP. The empirical findings are firstly, that the majority of the seminar discussion can be assigned to one of the three main DMG in the corpus: Responding, Debating and Problem Solving. Secondly, it characterises each discipline area according to two DMGs. Thirdly, the majority of the discussion is non-oppositional in nature, suggesting that ‘debate’ is not the only form of discussion that students need to be prepared for. Finally, while some characteristics of the discussion are tied to the DMG and common across disciplines, others are discipline specific. On a theoretical level, this study shows that an SFL genre model for investigating spoken discourse can be successfully extended to investigate longer stretches of discourse than have previously been identified. The methodological contribution is to demonstrate how corpus techniques can be combined with SFL genre theory to investigate extended stretches of spoken discussion. The thesis will be of value to those working in the field of teaching spoken EAP/ ESAP as well as to materials developers.
Resumo:
Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.