22 resultados para 380202 Computational Linguistics
Resumo:
Uncertainty text detection is important to many social-media-based applications since more and more users utilize social media platforms (e.g., Twitter, Facebook, etc.) as information source to produce or derive interpretations based on them. However, existing uncertainty cues are ineffective in social media context because of its specific characteristics. In this paper, we propose a variant of annotation scheme for uncertainty identification and construct the first uncertainty corpus based on tweets. We then conduct experiments on the generated tweets corpus to study the effectiveness of different types of features for uncertainty text identification. © 2013 Association for Computational Linguistics.
Resumo:
Storyline detection from news articles aims at summarizing events described under a certain news topic and revealing how those events evolve over time. It is a difficult task because it requires first the detection of events from news articles published in different time periods and then the construction of storylines by linking events into coherent news stories. Moreover, each storyline has different hierarchical structures which are dependent across epochs. Existing approaches often ignore the dependency of hierarchical structures in storyline generation. In this paper, we propose an unsupervised Bayesian model, called dynamic storyline detection model, to extract structured representations and evolution patterns of storylines. The proposed model is evaluated on a large scale news corpus. Experimental results show that our proposed model outperforms several baseline approaches.
Resumo:
Persuasive communication is the process of shaping, reinforcing and changing others' responses. In political debates, speakers express their views towards the debated topics by choosing both the content of their discourse and the argumentation process. In this work we study the use of semantic frames for modelling argumentation in speakers' discourse. We investigate the impact of a speaker's argumentation style and their effect in influencing an audience in supporting their candidature. We model the influence index of each candidate based on their relative standings in the polls released prior to the debate and present a system which ranks speakers in terms of their relative influence using a combination of content and persuasive argumentation features. Our results show that although content alone is predictive of a speaker's influence rank, persuasive argumentation also affects such indices.
Resumo:
Event extraction from texts aims to detect structured information such as what has happened, to whom, where and when. Event extraction and visualization are typically considered as two different tasks. In this paper, we propose a novel approach based on probabilistic modelling to jointly extract and visualize events from tweets where both tasks benefit from each other. We model each event as a joint distribution over named entities, a date, a location and event-related keywords. Moreover, both tweets and event instances are associated with coordinates in the visualization space. The manifold assumption that the intrinsic geometry of tweets is a low-rank, non-linear manifold within the high-dimensional space is incorporated into the learning framework using a regularization. Experimental results show that the proposed approach can effectively deal with both event extraction and visualization and performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualization.
Resumo:
Conventional topic models are ineffective for topic extraction from microblog messages since the lack of structure and context among the posts renders poor message-level word co-occurrence patterns. In this work, we organize microblog posts as conversation trees based on reposting and replying relations, which enrich context information to alleviate data sparseness. Our model generates words according to topic dependencies derived from the conversation structures. In specific, we differentiate messages as leader messages, which initiate key aspects of previously focused topics or shift the focus to different topics, and follower messages that do not introduce any new information but simply echo topics from the messages that they repost or reply. Our model captures the different extents that leader and follower messages may contain the key topical words, thus further enhances the quality of the induced topics. The results of thorough experiments demonstrate the effectiveness of our proposed model.
Resumo:
We study the problem of detecting sentences describing adverse drug reactions (ADRs) and frame the problem as binary classification. We investigate different neural network (NN) architectures for ADR classification. In particular, we propose two new neural network models, Convolutional Recurrent Neural Network (CRNN) by concatenating convolutional neural networks with recurrent neural networks, and Convolutional Neural Network with Attention (CNNA) by adding attention weights into convolutional neural networks. We evaluate various NN architectures on a Twitter dataset containing informal language and an Adverse Drug Effects (ADE) dataset constructed by sampling from MEDLINE case reports. Experimental results show that all the NN architectures outperform the traditional maximum entropy classifiers trained from n-grams with different weighting strategies considerably on both datasets. On the Twitter dataset, all the NN architectures perform similarly. But on the ADE dataset, CNN performs better than other more complex CNN variants. Nevertheless, CNNA allows the visualisation of attention weights of words when making classification decisions and hence is more appropriate for the extraction of word subsequences describing ADRs.
Resumo:
This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools.