7 resultados para Short-text clustering

em Aston University Research Archive


Relevância:

80.00% 80.00%

Publicador:

Resumo:

As microblog services such as Twitter become a fast and convenient communication approach, identification of trendy topics in microblog services has great academic and business value. However detecting trendy topics is very challenging due to huge number of users and short-text posts in microblog diffusion networks. In this paper we introduce a trendy topics detection system under computation and communication resource constraints. In stark contrast to retrieving and processing the whole microblog contents, we develop an idea of selecting a small set of microblog users and processing their posts to achieve an overall acceptable trendy topic coverage, without exceeding resource budget for detection. We formulate the selection operation of these subset users as mixed-integer optimization problems, and develop heuristic algorithms to compute their approximate solutions. The proposed system is evaluated with real-time test data retrieved from Sina Weibo, the dominant microblog service provider in China. It's shown that by monitoring 500 out of 1.6 million microblog users and tracking their microposts (about 15,000 daily) with our system, nearly 65% trendy topics can be detected, while on average 5 hours earlier before they appear in Sina Weibo official trends.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Short text messages a.k.a Microposts (e.g. Tweets) have proven to be an effective channel for revealing information about trends and events, ranging from those related to Disaster (e.g. hurricane Sandy) to those related to Violence (e.g. Egyptian revolution). Being informed about such events as they occur could be extremely important to authorities and emergency professionals by allowing such parties to immediately respond. In this work we study the problem of topic classification (TC) of Microposts, which aims to automatically classify short messages based on the subject(s) discussed in them. The accurate TC of Microposts however is a challenging task since the limited number of tokens in a post often implies a lack of sufficient contextual information. In order to provide contextual information to Microposts, we present and evaluate several graph structures surrounding concepts present in linked knowledge sources (KSs). Traditional TC techniques enrich the content of Microposts with features extracted only from the Microposts content. In contrast our approach relies on the generation of different weighted semantic meta-graphs extracted from linked KSs. We introduce a new semantic graph, called category meta-graph. This novel meta-graph provides a more fine grained categorisation of concepts providing a set of novel semantic features. Our findings show that such category meta-graph features effectively improve the performance of a topic classifier of Microposts. Furthermore our goal is also to understand which semantic feature contributes to the performance of a topic classifier. For this reason we propose an approach for automatic estimation of accuracy loss of a topic classifier on new, unseen Microposts. We introduce and evaluate novel topic similarity measures, which capture the similarity between the KS documents and Microposts at a conceptual level, considering the enriched representation of these documents. Extensive evaluation in the context of Emergency Response (ER) and Violence Detection (VD) revealed that our approach outperforms previous approaches using single KS without linked data and Twitter data only up to 31.4% in terms of F1 measure. Our main findings indicate that the new category graph contains useful information for TC and achieves comparable results to previously used semantic graphs. Furthermore our results also indicate that the accuracy of a topic classifier can be accurately predicted using the enhanced text representation, outperforming previous approaches considering content-based similarity measures. © 2014 Elsevier B.V. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Topic classification (TC) of short text messages offers an effective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolution). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the topics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detection (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets. Copyright 2013 ACM.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The search by many investigators for a solution to the reading problems encountered by individuals with no central vision has been long and, to date, not very fruitful. Most textual manipulations, including font size, have led to only modest gains in reading speed. Previous work on spatial integrative properties of peripheral retina suggests that 'visual crowding' may be a major factor contributing to inefficient reading. Crowding refers to the fact that juxtaposed targets viewed eccentrically may be difficult to identify. The purpose of this study was to assess the combined effects of line spacing and word spacing on the ability of individuals with age-related macular degeneration (ARMD) to read short passages of text that were printed with either high (87.5%) or low contrast (17.5%) letters. Low contrast text was used to avoid potential ceiling effects and to mimic a possible reduction in letter contrast with light scatter from media opacities. For both low and high contrast text, the fastest reading speeds we measured were for passages of text with double line and double word spacing. In comparison with standard single spacing, double word/line spacing increased reading speed by approximately 26% with high contrast text (p < 0.001), and by 46% with low contrast text (p < 0.001). In addition, double line/word spacing more than halved the number of reading errors obtained with single spaced text. We compare our results with previous reading studies on ARMD patients, and conclude that crowding is detrimental to reading and that its effects can be reduced with enhanced text spacing. Spacing is particularly important when the contrast of the text is reduced, as may occur with intraocular light scatter or poor viewing conditions. We recommend that macular disease patients should employ double line spacing and double-character word spacing to maximize their reading efficiency. © 2013 Blackmore-Wright et al.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Ontology construction for any domain is a labour intensive and complex process. Any methodology that can reduce the cost and increase efficiency has the potential to make a major impact in the life sciences. This paper describes an experiment in ontology construction from text for the Animal Behaviour domain. Our objective was to see how much could be done in a simple and rapid manner using a corpus of journal papers. We used a sequence of text processing steps, and describe the different choices made to clean the input, to derive a set of terms and to structure those terms in a hierarchy. We were able in a very short space of time to construct a 17000 term ontology with a high percentage of suitable terms. We describe some of the challenges, especially that of focusing the ontology appropriately given a starting point of a heterogeneous corpus.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In recent years, there has been an increas-ing interest in learning a distributed rep-resentation of word sense. Traditional context clustering based models usually require careful tuning of model parame-ters, and typically perform worse on infre-quent word senses. This paper presents a novel approach which addresses these lim-itations by first initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned represen-tations outperform the publicly available embeddings on 2 out of 4 metrics in the word similarity task, and 6 out of 13 sub tasks in the analogical reasoning task.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Building an interest model is the key to realize personalized text recommendation. Previous interest models neglect the fact that a user may have multiple angles of interests. Different angles of interest provide different requests and criteria for text recommendation. This paper proposes an interest model that consists of two kinds of angles: persistence and pattern, which can be combined to form complex angles. The model uses a new method to represent the long-term interest and the short-term interest, and distinguishes the interest on object and the interest on the link structure of objects. Experiments with news-scale text data show that the interest on object and the interest on link structure have real requirements, and it is effective to recommend texts according to the angles.