886 resultados para Corpus Lexicography
Resumo:
This paper discusses three important aspects of John Sinclair’s legacy: the corpus, lexicography, and the notion of ‘corpus-driven’. The corpus represents his concern with the nature of linguistic evidence. Lexicography is for him the canonical mode of language description at the lexical level. And his belief that the corpus should ‘drive’ the description is reflected in his constant attempts to utilize the emergent computer technologies to automate the initial stages of analysis and defer the intuitive, interpretative contributions of linguists to increasingly later stages in the process. Sinclair’s model of corpus-driven lexicography has spread far beyond its initial implementation at Cobuild, to most EFL dictionaries, to native-speaker dictionaries (e.g. the New Oxford Dictionary of English, and many national language dictionaries in emerging or re-emerging speech communities) and bilingual dictionaries (e.g. Collins, Oxford-Hachette).
Resumo:
Presentation for the 5th International Conference on Corpus Linguistics (CILC 2013), V Congreso Internacional de Lingüistica de Corpus.
Resumo:
In this dissertation, I present an overall methodological framework for studying linguistic alternations, focusing specifically on lexical variation in denoting a single meaning, that is, synonymy. As the practical example, I employ the synonymous set of the four most common Finnish verbs denoting THINK, namely ajatella, miettiä, pohtia and harkita ‘think, reflect, ponder, consider’. As a continuation to previous work, I describe in considerable detail the extension of statistical methods from dichotomous linguistic settings (e.g., Gries 2003; Bresnan et al. 2007) to polytomous ones, that is, concerning more than two possible alternative outcomes. The applied statistical methods are arranged into a succession of stages with increasing complexity, proceeding from univariate via bivariate to multivariate techniques in the end. As the central multivariate method, I argue for the use of polytomous logistic regression and demonstrate its practical implementation to the studied phenomenon, thus extending the work by Bresnan et al. (2007), who applied simple (binary) logistic regression to a dichotomous structural alternation in English. The results of the various statistical analyses confirm that a wide range of contextual features across different categories are indeed associated with the use and selection of the selected think lexemes; however, a substantial part of these features are not exemplified in current Finnish lexicographical descriptions. The multivariate analysis results indicate that the semantic classifications of syntactic argument types are on the average the most distinctive feature category, followed by overall semantic characterizations of the verb chains, and then syntactic argument types alone, with morphological features pertaining to the verb chain and extra-linguistic features relegated to the last position. In terms of overall performance of the multivariate analysis and modeling, the prediction accuracy seems to reach a ceiling at a Recall rate of roughly two-thirds of the sentences in the research corpus. The analysis of these results suggests a limit to what can be explained and determined within the immediate sentential context and applying the conventional descriptive and analytical apparatus based on currently available linguistic theories and models. The results also support Bresnan’s (2007) and others’ (e.g., Bod et al. 2003) probabilistic view of the relationship between linguistic usage and the underlying linguistic system, in which only a minority of linguistic choices are categorical, given the known context – represented as a feature cluster – that can be analytically grasped and identified. Instead, most contexts exhibit degrees of variation as to their outcomes, resulting in proportionate choices over longer stretches of usage in texts or speech.
Resumo:
This paper asserts the increasing importance of academic English in an increasingly Anglophone world, and looks at the differences between academic English and general English, especially in terms of vocabulary. The creation of wordlists has played an important role in trying to establish the academic English lexicon, but these wordlists are not based on appropriate data, or are implemented inappropriately. There is as yet no adequate dictionary of academic English, and this paper reports on new efforts at Aston University to create a suitable corpus on which such a dictionary could be based.
Resumo:
Este trabajo presenta la metodología empleada para compilar un corpus económico e identificar su terminología con el fin de crear un glosario de utilidad en la formación de traductores. Por una parte, se repasa brevemente la bibliografía sobre compilación de corpus y explotación con fines terminológicos. Por otra parte, se presenta la metodología en cuestión, así como una serie de actividades enfocadas a la adquisición de conocimiento especializado en economía. Los resultados muestran que las técnicas usadas para detectar términos y extraer automáticamente candidatos a término, si bien no terminan de adecuarse a las necesidades concretas del presente trabajo, son de utilidad e incluso pueden complementarse. Por su parte, las actividades propuestas pueden sumarse igualmente a otro tipo de actividades y modificarse según el contexto docente.
Resumo:
University students encounter difficulties with academic English because of its vocabulary, phraseology, and variability, and also because academic English differs in many respects from general English, the language which they have experienced before starting their university studies. Although students have been provided with many dictionaries that contain some helpful information on words used in academic English, these dictionaries remain focused on the uses of words in general English. There is therefore a gap in the dictionary market for a dictionary for university students, and this thesis provides a proposal for such a dictionary (called the Dictionary of Academic English; DOAE) in the form of a model which depicts how the dictionary should be designed, compiled, and offered to students. The model draws on state-of-the-art techniques in lexicography, dictionary-use research, and corpus linguistics. The model demanded the creation of a completely new corpus of academic language (Corpus of Academic Journal Articles; CAJA). The main advantages of the corpus are its large size (83.5 million words) and balance. Having access to a large corpus of academic language was essential for a corpus-driven approach to data analysis. A good corpus balance in terms of domains enabled a detailed domain-labelling of senses, patterns, collocates, etc. in the dictionary database, which was then used to tailor the output according to the needs of different types of student. The model proposes an online dictionary that is designed as an online dictionary from the outset. The proposed dictionary is revolutionary in the way it addresses the needs of different types of student. It presents students with a dynamic dictionary whose contents can be customised according to the user's native language, subject of study, variant spelling preferences, and/or visual preferences (e.g. black and white).
Resumo:
Si se pretende elaborar un diccionario de adjetivos, ya sea este monolingüe o bilingüe, la primera tarea que se le impone al lexicógrafo es la de definir qué es un adjetivo, una cuestión que todavía hoy no ha sido resuelta satisfactoriamente. En alemán hay una serie de palabras que han sido descritas tradicionalmente como adjetivos en función exclusivamente predicativa, cuyo estatus como adjetivos es, sin embargo, cuestionado por algunos autores. En este artículo se trata de dilucidar si estas palabras realmente solo pueden aparecer en función predicativa, cómo se las describe en diccionarios y gramáticas y cuáles son sus principales correspondencias en español, a fin de decidir si deberían ser incluidas en un corpus destinado a la elaboración de un diccionario sintáctico de adjetivos alemán-español.
Resumo:
The QUT-NOISE-TIMIT corpus consists of 600 hours of noisy speech sequences designed to enable a thorough evaluation of voice activity detection (VAD) algorithms across a wide variety of common background noise scenarios. In order to construct the final mixed-speech database, a collection of over 10 hours of background noise was conducted across 10 unique locations covering 5 common noise scenarios, to create the QUT-NOISE corpus. This background noise corpus was then mixed with speech events chosen from the TIMIT clean speech corpus over a wide variety of noise lengths, signal-to-noise ratios (SNRs) and active speech proportions to form the mixed-speech QUT-NOISE-TIMIT corpus. The evaluation of five baseline VAD systems on the QUT-NOISE-TIMIT corpus is conducted to validate the data and show that the variety of noise available will allow for better evaluation of VAD systems than existing approaches in the literature.
Resumo:
Extracellular matrix regulates many cellular processes likely to be important for development and regression of corpora lutea. Therefore, we identified the types and components of the extracellular matrix of the human corpus luteum at different stages of the menstrual cycle. Two different types of extracellular matrix were identified by electron microscopy; subendothelial basal laminas and an interstitial matrix located as aggregates at irregular intervals between the non-vascular cells. No basal laminas were associated with luteal cells. At all stages, collagen type IV α1 and laminins α5, β2 and γ1 were localized by immunohistochemistry to subendothelial basal laminas, and collagen type IV α1 and laminins α2, α5, β1 and β2 localized in the interstitial matrix. Laminin α4 and β1 chains occurred in the subendothelial basal lamina from mid-luteal stage to regression; at earlier stages, a punctate pattern of staining was observed. Therefore, human luteal subendothelial basal laminas potentially contain laminin 11 during early luteal development and, additionally, laminins 8, 9 and 10 at the mid-luteal phase. Laminin α1 and α3 chains were not detected in corpora lutea. Versican localized to the connective tissue extremities of the corpus luteum. Thus, during the formation of the human corpus luteum, remodelling of extracellular matrix does not result in basal laminas as present in the adrenal cortex or ovarian follicle. Instead, novel aggregates of interstitial matrix of collagen and laminin are deposited within the luteal parenchyma, and it remains to be seen whether this matrix is important for maintaining the luteal cell phenotype.
Resumo:
In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia.
Resumo:
Measures of semantic similarity between medical concepts are central to a number of techniques in medical informatics, including query expansion in medical information retrieval. Previous work has mainly considered thesaurus-based path measures of semantic similarity and has not compared different corpus-driven approaches in depth. We evaluate the effectiveness of eight common corpus-driven measures in capturing semantic relatedness and compare these against human judged concept pairs assessed by medical professionals. Our results show that certain corpus-driven measures correlate strongly (approx 0.8) with human judgements. An important finding is that performance was significantly affected by the choice of corpus used in priming the measure, i.e., used as evidence from which corpus-driven similarities are drawn. This paper provides guidelines for the implementation of semantic similarity measures for medical informatics and concludes with implications for medical information retrieval.