885 resultados para Compilación de corpus
Resumo:
Guaranteeing the quality of extracted features that describe relevant knowledge to users or topics is a challenge because of the large number of extracted features. Most popular existing term-based feature selection methods suffer from noisy feature extraction, which is irrelevant to the user needs (noisy). One popular method is to extract phrases or n-grams to describe the relevant knowledge. However, extracted n-grams and phrases usually contain a lot of noise. This paper proposes a method for reducing the noise in n-grams. The method first extracts more specific features (terms) to remove noisy features. The method then uses an extended random set to accurately weight n-grams based on their distribution in the documents and their terms distribution in n-grams. The proposed approach not only reduces the number of extracted n-grams but also improves the performance. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms the state-of-art methods underpinned by Okapi BM25, tf*idf and Rocchio.
Resumo:
Topic modelling has been widely used in the fields of information retrieval, text mining, machine learning, etc. In this paper, we propose a novel model, Pattern Enhanced Topic Model (PETM), which makes improvements to topic modelling by semantically representing topics with discriminative patterns, and also makes innovative contributions to information filtering by utilising the proposed PETM to determine document relevance based on topics distribution and maximum matched patterns proposed in this paper. Extensive experiments are conducted to evaluate the effectiveness of PETM by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed model significantly outperforms both state-of-the-art term-based models and pattern-based models.
Resumo:
Chinese modal particles feature prominently in Chinese people’s daily use of the language, but their pragmatic and semantic functions are elusive as commonly recognised by Chinese linguists and teachers of Chinese as a foreign language. This book originates from an extensive and intensive empirical study of the Chinese modal particle a (啊), one of the most frequently used modal particles in Mandarin Chinese. In order to capture all the uses and the underlying meanings of the particle, the author transcribed the first 20 episodes, about 20 hours in length, of the popular Chinese TV drama series Kewang ‘Expectations’, which yielded a corpus data of more than 142’000 Chinese characters with a total of 1829 instances of the particle all used in meaningful communicative situations. Within its context of use, every single occurrence of the particle was analysed in terms of its pragmatic and semantic contributions to the hosting utterance. Upon this basis the core meanings were identified which were seen as constituting the modal nature of the particle.
Resumo:
Many mature term-based or pattern-based approaches have been used in the field of information filtering to generate users’ information needs from a collection of documents. A fundamental assumption for these approaches is that the documents in the collection are all about one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling, such as Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in a collection of documents, and this has been widely utilized in the fields of machine learning and information retrieval, etc. But its effectiveness in information filtering has not been so well explored. Patterns are always thought to be more discriminative than single terms for describing documents. However, the enormous amount of discovered patterns hinder them from being effectively and efficiently used in real applications, therefore, selection of the most discriminative and representative patterns from the huge amount of discovered patterns becomes crucial. To deal with the above mentioned limitations and problems, in this paper, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is proposed. The main distinctive features of the proposed model include: (1) user information needs are generated in terms of multiple topics; (2) each topic is represented by patterns; (3) patterns are generated from topic models and are organized in terms of their statistical and taxonomic features, and; (4) the most discriminative and representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance to the user’s information needs in order to filter out irrelevant documents. Extensive experiments are conducted to evaluate the effectiveness of the proposed model by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed model significantly outperforms both state-of-the-art term-based models and pattern-based models
Resumo:
Term-based approaches can extract many features in text documents, but most include noise. Many popular text-mining strategies have been adapted to reduce noisy information from extracted features; however, text-mining techniques suffer from low frequency. The key issue is how to discover relevance features in text documents to fulfil user information needs. To address this issue, we propose a new method to extract specific features from user relevance feedback. The proposed approach includes two stages. The first stage extracts topics (or patterns) from text documents to focus on interesting topics. In the second stage, topics are deployed to lower level terms to address the low-frequency problem and find specific terms. The specific terms are determined based on their appearances in relevance feedback and their distribution in topics or high-level patterns. We test our proposed method with extensive experiments in the Reuters Corpus Volume 1 dataset and TREC topics. Results show that our proposed approach significantly outperforms the state-of-the-art models.
Resumo:
In this introductory chapter to Schmeinck, D. and Lidstone, J. (2014) “Current trends and issues in geographical education” in Schmeinck, D. and Lidstone, J. (2014) Eds) Standards and Research in Geographical Education: Current Trends and International Issues. Berlin. Mensch und Buch Verlag. Pp. 5 - 16. , the authors review and analyse eleven papers originally presented to the Congress of the International Geographical Union held in Cologne in 2012. Taking the collection of papers as a single corpus representing the “state of the art” of geography education, they applied lexical and bibliometric analyses in an innovative attempt to identify the nature of geographical education as represented by this anthology of peer reviewed chapters presented at the start of the second decade of the Twenty-first century?
Resumo:
Objective: To develop a system for the automatic classification of pathology reports for Cancer Registry notifications. Method: A two pass approach is proposed to classify whether pathology reports are cancer notifiable or not. The first pass queries pathology HL7 messages for known report types that are received by the Queensland Cancer Registry (QCR), while the second pass aims to analyse the free text reports and identify those that are cancer notifiable. Cancer Registry business rules, natural language processing and symbolic reasoning using the SNOMED CT ontology were adopted in the system. Results: The system was developed on a corpus of 500 histology and cytology reports (with 47% notifiable reports) and evaluated on an independent set of 479 reports (with 52% notifiable reports). Results show that the system can reliably classify cancer notifiable reports with a sensitivity, specificity, and positive predicted value (PPV) of 0.99, 0.95, and 0.95, respectively for the development set, and 0.98, 0.96, and 0.96 for the evaluation set. High sensitivity can be achieved at a slight expense in specificity and PPV. Conclusion: The system demonstrates how medical free-text processing enables the classification of cancer notifiable pathology reports with high reliability for potential use by Cancer Registries and pathology laboratories.
Resumo:
The most significant recent development in scholarly publishing is the open-access movement, which seeks to provide free online access to scholarly literature. Though this movement is well developed in scientific and medical disciplines, American law reviews are almost completely unaware of the possibilities of open-access publishing models. This Essay explains how open-access publishing works, why it is important, and makes the case for its widespread adoption by law reviews. It also reports on a survey of law review publication policies conducted in 2004. This survey shows, inter alia, that few law reviews have embraced the opportunities of open-access publishing, and many of the top law reviews are acting as stalking horses for the commercial interests of legal database providers. The open-access model promises greater access to legal scholarship, wider readership for law reviews, and reputational befits for law reviews and the law schools that house them. This Essay demonstrates how open access comports with the institutional aims of law schools and law reviews, and is better suited to the unique environment of legal publishing than the model that law reviews currently pursue. Moreover, the institutional structure of law reviews means that it is possible that the entire corpus of law reviews could easily move to an open-access model, making law the first discipline with a realistic prospect of complete commitment to free, open access to all scholarly output.
Resumo:
The aim of this thesis is to show how character analysis can be used to approach conceptions of saga authorship in medieval Iceland. The idea of possession is a metaphor that is adopted early in the thesis, and is used to describe Icelandic sagas as works in which traditional material is subtly interpreted by medieval authors. For example, we can say that if authors claim greater possession of the sagas, they interpret, and not merely record, the sagas' historical information. On the other hand, tradition holds onto its possession of the narrative whenever it is not possible for an author to develop his own creative and historical interests. The metaphor of possession also underpins the character analysis in the thesis, which is based on the idea that saga authors used characters as a vehicle by which to possess saga narratives and so develop their own historical interests. The idea of possession signals the kinds of problems of authorship study which are addressed here, in particular, the question of the authors' sense of saga writing as an act either of preservation or of creation. While, in that sense, the thesis represents an additional voice in a long-standing debate about the saga writers' relation to their source materials, I argue against a clear-cut distinction between creative and non-creative authors, and focus instead on the wide variation in authorial control over saga materials. This variation suggests that saga authorship is a multi-functional activity, or one which co-exists with tradition. Further, by emphasising characterisation as a method, I am adding to the weight of scholarship that seeks to understand the sagas in terms of their literary effects. The Introduction and chapter one lay out the theoretical scope of this thesis. My aim in these first two sections is to inform the reader of the type of critical questions that arise when authorship is approached in relation to characterisation, and to suggest an interpretive framework with which to approach these questions. In the Introduction this aim manifests as a brief discussion of the application of the term "authorship" to the medieval Icelandic corpus, a definition of the scope of this study, and an introduction to the connections, made throughout this thesis, between saga authors, the sagas' narrative style, and the style of characterisation in the sagas. Chapter one is a far more detailed discussion of our ability to make these connections. In particular, the chapter develops the definition of the analytical term "secondary authorship" that I introduce in order to delineate the type of characterisation that is of most interest in this thesis. "Secondary authorship" is a literary term that aims to sharpen our approach to saga authors' relationship to their characters by focusing on characters who make representations about the events of the saga. The term refers to any instance in which characters behave in a manner that resembles the creativity, interpretation, and understanding associated with authorship more generally. Character analysis cannot, however, be divorced from socio-historical approaches to the saga corpus. Most importantly, the sagas themselves are socio-historical representations that claim some degree of truth value. This claim that the sagas make by implication about their historicity is the starting point of a discussion of authorship in medieval Iceland. Therefore, at the beginning of chapter one I discuss some of the approaches to the social context of saga writing. This discussion serves as an introduction to both the culture of saga writing in medieval Iceland and to the nature of the sagas' historical perspective, and reflects my sense that literary interpretations of the sagas cannot be isolated from the historical discourses that frame them. The chapter also discusses possession, which, as I note above, is used alongside the concept of secondary authorship to describe the saga authors' relationship with the stories and characters of the past. At the close of chapter one, I offer a preliminary list the various functions of saga authorship, and give some examples of secondary authorship. From this point I am able to tie my argument about secondary authorship to specific examples from the sagas. Chapter two examines the effect of family obligations and domestic points of view in the depiction of characters' choices and conception of themselves. The examples that are given in that chapter - from Gisla saga Súrssonar and Íslendinga saga - are the first of a number of textual analyses that demonstrate the application of the concepts of secondary authorship and possession of saga narratives. The relationship between narratives about national and domestic matters shows how authorial creativity in the area of kinship obligation provides the basis for the saga's development of historical themes. Thus, the two major case studies given in chapter two tie authorial engagement with characters to the most influential social institution in early and medieval Iceland, the family. The remaining chapters represent similar attempts to relate authorial possession of saga characters to central socio-historical themes in the sagas, such as the settlement process in early Iceland and its influence on the development of regional political life (chapter three). Likewise, the strong authorial interest in an Icelander's journey to Norway in Heimskringla is presented as evidence of the author's use of a saga character to express an Icelandic interpretation of Norwegian history and to promote a sense that Iceland shared the ownership of regal history with Norway (chapter four). In that authorial engagement with the Icelander abroad, we witness saga characterisation being used as a basis for historical interpretation and the means by which foreign traditions and influence, not least the narratives of royal lives and of the Christianisation, are claimed as part of medieval Icelanders' self-conception. While saga authors observe the conventions of saga narration, characters are often subtly positioned as the authors' interpretive mirrors, especially clear than when they act as secondary authors. Nowhere is this more apparent than in Brennu- Njáls saga, which contains many characters who voice the author's claim to interpret the past. Even Hrútr Herjólfsson, through his remarkable perception of events and his conspicuous comments about them, acts as a secondary author by enabling the author to emphasise the importance of the disposition of characters. In Laxdœla saga and Þorgils saga ok Hafliða, authorial interest in characters' perception is matched by the thematising of learning, from the inception of knowledge as prophecy or advice to complete understanding by saga characters (chapter six). In Þorgils saga skarða, a character's inner development from an excessively ambitious and politically ruthless youth to a Christian leader killed by his kinsman allows the author to shape a political life into a lesson about leadership and the community's ability to moderate and contain the behaviour of extraordinary individuals. The portrayal draws on methods of characterisation that we can identify in Grettis saga Ásmundarson, Fóstbrœðra saga, and Orkneyinga saga. A comparison of the characterisation of figures with intense political or military ambitions suggests that saga authors were interested in the community's ability to balance their strength and ability with a degree of social moderation. The discussion of these sagas shows that character study can be used to analyse how the saga authors added their own voice to the voices passed down to medieval Icelanders in traditional narratives. Authorial engagement with characters allowed inherited traditions about early Norway and Iceland and records of thirteenth century events to be transformed into sophisticated historical works with highly creative elements. Through secondary authorship, saga authors took joint-possession of narratives and contested the power of tradition in setting the interpretive framework of a saga.
Resumo:
Semantic space models of word meaning derived from co-occurrence statistics within a corpus of documents, such as the Hyperspace Analogous to Language (HAL) model, have been proposed in the past. While word similarity can be computed using these models, it is not clear how semantic spaces derived from different sets of documents can be compared. In this paper, we focus on this problem, and we revisit the proposal of using semantic subspace distance measurements [1]. In particular, we outline the research questions that still need to be addressed to investigate and validate these distance measures. Then, we describe our plans for future research.
Resumo:
Semantic Space models, which provide a numerical representation of words’ meaning extracted from corpus of documents, have been formalized in terms of Hermitian operators over real valued Hilbert spaces by Bruza et al. [1]. The collapse of a word into a particular meaning has been investigated applying the notion of quantum collapse of superpositional states [2]. While the semantic association between words in a Semantic Space can be computed by means of the Minkowski distance [3] or the cosine of the angle between the vector representation of each pair of words, a new procedure is needed in order to establish relations between two or more Semantic Spaces. We address the question: how can the distance between different Semantic Spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance. Such distance needs to take into account the difference in the dimensions between subspaces. The availability of a distance for comparing different Semantic Subspaces would enable to achieve a deeper understanding about the geometry of Semantic Spaces which would possibly translate into better effectiveness in Information Retrieval tasks.
Resumo:
This paper presents our system to address the CogALex-IV 2014 shared task of identifying a single word most semantically related to a group of 5 words (queries). Our system uses an implementation of a neural language model and identifies the answer word by finding the most semantically similar word representation to the sum of the query representations. It is a fully unsupervised system which learns on around 20% of the UkWaC corpus. It correctly identifies 85 exact correct targets out of 2,000 queries, 285 approximate targets in lists of 5 suggestions.
Resumo:
In this paper we propose a novel scheme for carrying out speaker diarization in an iterative manner. We aim to show that the information obtained through the first pass of speaker diarization can be reused to refine and improve the original diarization results. We call this technique speaker rediarization and demonstrate the practical application of our rediarization algorithm using a large archive of two-speaker telephone conversation recordings. We use the NIST 2008 SRE summed telephone corpora for evaluating our speaker rediarization system. This corpus contains recurring speaker identities across independent recording sessions that need to be linked across the entire corpus. We show that our speaker rediarization scheme can take advantage of inter-session speaker information, linked in the initial diarization pass, to achieve a 30% relative improvement over the original diarization error rate (DER) after only two iterations of rediarization.
Resumo:
In this paper we present a novel scheme for improving speaker diarization by making use of repeating speakers across multiple recordings within a large corpus. We call this technique speaker re-diarization and demonstrate that it is possible to reuse the initial speaker-linked diarization outputs to boost diarization accuracy within individual recordings. We first propose and evaluate two novel re-diarization techniques. We demonstrate their complementary characteristics and fuse the two techniques to successfully conduct speaker re-diarization across the SAIVT-BNEWS corpus of Australian broadcast data. This corpus contains recurring speakers in various independent recordings that need to be linked across the dataset. We show that our speaker re-diarization approach can provide a relative improvement of 23% in diarization error rate (DER), over the original diarization results, as well as improve the estimated number of speakers and the cluster purity and coverage metrics.
Resumo:
We present a novel method for improving hierarchical speaker clustering in the tasks of speaker diarization and speaker linking. In hierarchical clustering, a tree can be formed that demonstrates various levels of clustering. We propose a ratio that expresses the impact of each cluster on the formation of this tree and use this to rescale cluster scores. This provides score normalisation based on the impact of each cluster. We use a state-of-the-art speaker diarization and linking system across the SAIVT-BNEWS corpus to show that our proposed impact ratio can provide a relative improvement of 16% in diarization error rate (DER).