22 resultados para Computational linguistics
Resumo:
It is important to help researchers find valuable papers from a large literature collection. To this end, many graph-based ranking algorithms have been proposed. However, most of these algorithms suffer from the problem of ranking bias. Ranking bias hurts the usefulness of a ranking algorithm because it returns a ranking list with an undesirable time distribution. This paper is a focused study on how to alleviate ranking bias by leveraging the heterogeneous network structure of the literature collection. We propose a new graph-based ranking algorithm, MutualRank, that integrates mutual reinforcement relationships among networks of papers, researchers, and venues to achieve a more synthetic, accurate, and less-biased ranking than previous methods. MutualRank provides a unified model that involves both intra- and inter-network information for ranking papers, researchers, and venues simultaneously. We use the ACL Anthology Network as the benchmark data set and construct the gold standard from computer linguistics course websites of well-known universities and two well-known textbooks. The experimental results show that MutualRank greatly outperforms the state-of-the-art competitors, including PageRank, HITS, CoRank, Future Rank, and P-Rank, in ranking papers in both improving ranking effectiveness and alleviating ranking bias. Rankings of researchers and venues by MutualRank are also quite reasonable.
Resumo:
The use of ontologies as representations of knowledge is widespread but their construction, until recently, has been entirely manual. We argue in this paper for the use of text corpora and automated natural language processing methods for the construction of ontologies. We delineate the challenges and present criteria for the selection of appropriate methods. We distinguish three ma jor steps in ontology building: associating terms, constructing hierarchies and labelling relations. A number of methods are presented for these purposes but we conclude that the issue of data-sparsity still is a ma jor challenge. We argue for the use of resources external tot he domain specific corpus.
Resumo:
Yorick Wilks is a central figure in the fields of Natural Language Processing and Artificial Intelligence. His influence extends to many areas and includes contributions to Machines Translation, word sense disambiguation, dialogue modeling and Information Extraction. This book celebrates the work of Yorick Wilks in the form of a selection of his papers which are intended to reflect the range and depth of his work. The volume accompanies a Festschrift which celebrates his contribution to the fields of Computational Linguistics and Artificial Intelligence. The papers include early work carried out at Cambridge University, descriptions of groundbreaking work on Machine Translation and Preference Semantics as well as more recent works on belief modeling and computational semantics. The selected papers reflect Yorick’s contribution to both practical and theoretical aspects of automatic language processing.
Resumo:
Information technology has increased both the speed and medium of communication between nations. It has brought the world closer, but it has also created new challenges for translation — how we think about it, how we carry it out and how we teach it. Translation and Information Technology has brought together experts in computational linguistics, machine translation, translation education, and translation studies to discuss how these new technologies work, the effect of electronic tools, such as the internet, bilingual corpora, and computer software, on translator education and the practice of translation, as well as the conceptual gaps raised by the interface of human and machine.
Resumo:
This is the first published edition of John Sinclair, Susan Jones and Robert Daley's research on collocation undertaken in 1970. The unpublished report was circulated amongst a small group of academics and was enormously influential, sparking a growth of interest in collocation amongst researchers in linguistics. Collocation was first viewed as important in computational linguistics in the work of Harold Palmer in Japan. Later M.A.K. Halliday and John Sinclair published on collocation in the 1960s. English Collocation Studies is a report on empirical research into collocation, devised by Halliday with Sinclair acting as the Principal Investigator and editor of the resultant OSTI report. The present edition contains an introduction by Professor Wolfgang Teubert based on his interview with John Sinclair. The introduction assesses the extent to which the findings of the original research have developed in the intervening years, and how some of the techniques mentioned in the report were implemented in the COBUILD project at Birmingham University in the 1980s.
Resumo:
We propose a hybrid generative/discriminative framework for semantic parsing which combines the hidden vector state (HVS) model and the hidden Markov support vector machines (HM-SVMs). The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. The HM-SVMs combine the advantages of the hidden Markov models and the support vector machines. By employing a modified K-means clustering method, a small set of most representative sentences can be automatically selected from an un-annotated corpus. These sentences together with their abstract annotations are used to train an HVS model which could be subsequently applied on the whole corpus to generate semantic parsing results. The most confident semantic parsing results are selected to generate a fully-annotated corpus which is used to train the HM-SVMs. The proposed framework has been tested on the DARPA Communicator Data. Experimental results show that an improvement over the baseline HVS parser has been observed using the hybrid framework. When compared with the HM-SVMs trained from the fully-annotated corpus, the hybrid framework gave a comparable performance with only a small set of lightly annotated sentences. © 2008. Licensed under the Creative Commons.
Resumo:
This paper presents a comparative study of three closely related Bayesian models for unsupervised document level sentiment classification, namely, the latent sentiment model (LSM), the joint sentiment-topic (JST) model, and the Reverse-JST model. Extensive experiments have been conducted on two corpora, the movie review dataset and the multi-domain sentiment dataset. It has been found that while all the three models achieve either better or comparable performance on these two corpora when compared to the existing unsupervised sentiment classification approaches, both JST and Reverse-JST are able to extract sentiment-oriented topics. In addition, Reverse-JST always performs worse than JST suggesting that the JST model is more appropriate for joint sentiment topic detection.
Resumo:
Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning.
Resumo:
The concept of plagiarism is not uncommonly associated with the concept of intellectual property, both for historical and legal reasons: the approach to the ownership of ‘moral’, nonmaterial goods has evolved to the right to individual property, and consequently a need was raised to establish a legal framework to cope with the infringement of those rights. The solution to plagiarism therefore falls most often under two categories: ethical and legal. On the ethical side, education and intercultural studies have addressed plagiarism critically, not only as a means to improve academic ethics policies (PlagiarismAdvice.org, 2008), but mainly to demonstrate that if anything the concept of plagiarism is far from being universal (Howard & Robillard, 2008). Even if differently, Howard (1995) and Scollon (1994, 1995) argued, and Angèlil-Carter (2000) and Pecorari (2008) later emphasised that the concept of plagiarism cannot be studied on the grounds that one definition is clearly understandable by everyone. Scollon (1994, 1995), for example, claimed that authorship attribution is particularly a problem in non-native writing in English, and so did Pecorari (2008) in her comprehensive analysis of academic plagiarism. If among higher education students plagiarism is often a problem of literacy, with prior, conflicting social discourses that may interfere with academic discourse, as Angèlil-Carter (2000) demonstrates, we then have to aver that a distinction should be made between intentional and inadvertent plagiarism: plagiarism should be prosecuted when intentional, but if it is part of the learning process and results from the plagiarist’s unfamiliarity with the text or topic it should be considered ‘positive plagiarism’ (Howard, 1995: 796) and hence not an offense. Determining the intention behind the instances of plagiarism therefore determines the nature of the disciplinary action adopted. Unfortunately, in order to demonstrate the intention to deceive and charge students with accusations of plagiarism, teachers necessarily have to position themselves as ‘plagiarism police’, although it has been argued otherwise (Robillard, 2008). Practice demonstrates that in their daily activities teachers will find themselves being required a command of investigative skills and tools that they most often lack. We thus claim that the ‘intention to deceive’ cannot inevitably be dissociated from plagiarism as a legal issue, even if Garner (2009) asserts that generally plagiarism is immoral but not illegal, and Goldstein (2003) makes the same severance. However, these claims, and the claim that only cases of copyright infringement tend to go to court, have recently been challenged, mainly by forensic linguists, who have been actively involved in cases of plagiarism. Turell (2008), for instance, demonstrated that plagiarism is often connoted with an illegal appropriation of ideas. Previously, she (Turell, 2004) had demonstrated by comparison of four translations of Shakespeare’s Julius Caesar to Spanish that the use of linguistic evidence is able to demonstrate instances of plagiarism. This challenge is also reinforced by practice in international organisations, such as the IEEE, to whom plagiarism potentially has ‘severe ethical and legal consequences’ (IEEE, 2006: 57). What plagiarism definitions used by publishers and organisations have in common – and which the academia usually lacks – is their focus on the legal nature. We speculate that this is due to the relation they intentionally establish with copyright laws, whereas in education the focus tends to shift from the legal to the ethical aspects. However, the number of plagiarism cases taken to court is very small, and jurisprudence is still being developed on the topic. In countries within the Civil Law tradition, Turell (2008) claims, (forensic) linguists are seldom called upon as expert witnesses in cases of plagiarism, either because plagiarists are rarely taken to court or because there is little tradition of accepting linguistic evidence. In spite of the investigative and evidential potential of forensic linguistics to demonstrate the plagiarist’s intention or otherwise, this potential is restricted by the ability to identify a text as being suspect of plagiarism. In an era with such a massive textual production, ‘policing’ plagiarism thus becomes an extraordinarily difficult task without the assistance of plagiarism detection systems. Although plagiarism detection has attracted the attention of computer engineers and software developers for years, a lot of research is still needed. Given the investigative nature of academic plagiarism, plagiarism detection has of necessity to consider not only concepts of education and computational linguistics, but also forensic linguistics. Especially, if intended to counter claims of being a ‘simplistic response’ (Robillard & Howard, 2008). In this paper, we use a corpus of essays written by university students who were accused of plagiarism, to demonstrate that a forensic linguistic analysis of improper paraphrasing in suspect texts has the potential to identify and provide evidence of intention. A linguistic analysis of the corpus texts shows that the plagiarist acts on the paradigmatic axis to replace relevant lexical items with a related word from the same semantic field, i.e. a synonym, a subordinate, a superordinate, etc. In other words, relevant lexical items were replaced with related, but not identical, ones. Additionally, the analysis demonstrates that the word order is often changed intentionally to disguise the borrowing. On the other hand, the linguistic analysis of linking and explanatory verbs (i.e. referencing verbs) and prepositions shows that these have the potential to discriminate instances of ‘patchwriting’ and instances of plagiarism. This research demonstrates that the referencing verbs are borrowed from the original in an attempt to construct the new text cohesively when the plagiarism is inadvertent, and that the plagiarist has made an effort to prevent the reader from identifying the text as plagiarism, when it is intentional. In some of these cases, the referencing elements prove being able to identify direct quotations and thus ‘betray’ and denounce plagiarism. Finally, we demonstrate that a forensic linguistic analysis of these verbs is critical to allow detection software to identify them as proper paraphrasing and not – mistakenly and simplistically – as plagiarism.
Resumo:
The semantic web (SW) vision is one in which rich, ontology-based semantic markup will become widely available. The availability of semantic markup on the web opens the way to novel, sophisticated forms of question answering. AquaLog is a portable question-answering system which takes queries expressed in natural language (NL) and an ontology as input, and returns answers drawn from one or more knowledge bases (KB). AquaLog presents an elegant solution in which different strategies are combined together in a novel way. AquaLog novel ontology-based relation similarity service makes sense of user queries.
Resumo:
We introduce ReDites, a system for realtime event detection, tracking, monitoring and visualisation. It is designed to assist Information Analysts in understanding and exploring complex events as they unfold in the world. Events are automatically detected from the Twitter stream. Then those that are categorised as being security-relevant are tracked, geolocated, summarised and visualised for the end-user. Furthermore, the system tracks changes in emotions over events, signalling possible flashpoints or abatement. We demonstrate the capabilities of ReDites using an extended use case from the September 2013 Westgate shooting incident. Through an evaluation of system latencies, we also show that enriched events are made available for users to explore within seconds of that event occurring.
Resumo:
With the proliferation of social media sites, social streams have proven to contain the most up-to-date information on current events. Therefore, it is crucial to extract events from the social streams such as tweets. However, it is not straightforward to adapt the existing event extraction systems since texts in social media are fragmented and noisy. In this paper we propose a simple and yet effective Bayesian model, called Latent Event Model (LEM), to extract structured representation of events from social media. LEM is fully unsupervised and does not require annotated data for training. We evaluate LEM on a Twitter corpus. Experimental results show that the proposed model achieves 83% in F-measure, and outperforms the state-of-the-art baseline by over 7%.© 2014 Association for Computational Linguistics.
Resumo:
Latent topics derived by topic models such as Latent Dirichlet Allocation (LDA) are the result of hidden thematic structures which provide further insights into the data. The automatic labelling of such topics derived from social media poses however new challenges since topics may characterise novel events happening in the real world. Existing automatic topic labelling approaches which depend on external knowledge sources become less applicable here since relevant articles/concepts of the extracted topics may not exist in external sources. In this paper we propose to address the problem of automatic labelling of latent topics learned from Twitter as a summarisation problem. We introduce a framework which apply summarisation algorithms to generate topic labels. These algorithms are independent of external sources and only rely on the identification of dominant terms in documents related to the latent topic. We compare the efficiency of existing state of the art summarisation algorithms. Our results suggest that summarisation algorithms generate better topic labels which capture event-related context compared to the top-n terms returned by LDA. © 2014 Association for Computational Linguistics.
Resumo:
In recent years, there has been an increas-ing interest in learning a distributed rep-resentation of word sense. Traditional context clustering based models usually require careful tuning of model parame-ters, and typically perform worse on infre-quent word senses. This paper presents a novel approach which addresses these lim-itations by first initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned represen-tations outperform the publicly available embeddings on 2 out of 4 metrics in the word similarity task, and 6 out of 13 sub tasks in the analogical reasoning task.
Resumo:
School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. This paper makes the first attempt at this problem. We focus on one aspect of the problem: do characteristic school-of-thought words exist and whether they are characterizable? To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. SOT distinguishes between two types of school-of-thought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. Narrative and quantitative experiments show positive and promising results to the questions raised above © 2013 Association for Computational Linguistics. © 2013 Association for Computational Linguistics.