968 resultados para Short-text clustering


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Important words, which usually exist in part of Title, Subject and Keywords, can briefly reflect the main topic of a document. In recent years, it is a common practice to exploit the semantic topic of documents and utilize important words to achieve document clustering, especially for short texts such as news articles. This paper proposes a novel method to extract important words from Subject and Keywords of articles, and then partition documents only with those important words. Considering the fact that frequencies of important words are usually low and the scale matrix dataset for important words is small, a normalization method is then proposed to normalize the scale dataset so that more accurate results can be achieved by sufficiently exploiting the limited information. The experiments validate the effectiveness of our method.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Streams of short text, such as news titles, enable us to effectively and efficiently learn the real world events that occur anywhere and anytime. Short text messages that are companied by timestamps and generally brief events using only a few words differ from other longer text documents, such as web pages, news stories, blogs, technical papers and books. For example, few words repeat in the same news titles, thus frequency of the term (i.e., TF) is not as important in short text corpus as in longer text corpus. Therefore, analysis of short text faces new challenges. Also, detecting and tracking events through short text analysis need to reliably identify events from constant topic clusters; however, existing methods, such as Latent Dirichlet Allocation (LDA), generates different topic results for a corpus at different executions. In this paper, we provide a Finding Topic Clusters using Co-occurring Terms (FTCCT) algorithm to automatically generate topics from a short text corpus, and develop an Event Evolution Mining (EEM) algorithm to discover hot events and their evolutions (i.e., the popularity degrees of events changing over time). In FTCCT, a term (i.e., a single word or a multiple-words phrase) belongs to only one topic in a corpus. Experiments on news titles of 157 countries within 4 months (from July to October, 2013) demonstrate that our FTCCT-based method (combining FTCCT and EEM) achieves far higher quality of the event's content and description words than LDA-based method (combining LDA and EEM) for analysis of streams of short text. Our method also visualizes the evolutions of the hot events. The discovered world-wide event evolutions have explored some interesting correlations of the world-wide events; for example, successive extreme weather phenomenon occur in different locations - typhoon in Hong Kong and Philippines followed hurricane and storm flood in Mexico in September 2013. © 2014 Springer Science+Business Media New York.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents an integration of a novel document vector representation technique and a novel Growing Self Organizing Process. In this new approach, documents are represented as a low dimensional vector, which is composed of the indices and weights derived from the keywords of the document.

An index based similarity calculation method is employed on this low dimensional feature space and the growing self organizing process is modified to comply with the new feature representation model.

The initial experiments show that this novel integration outperforms the state-of-the-art Self Organizing Map based techniques of text clustering in terms of its efficiency while preserving the same accuracy level.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Text clustering can be considered as a four step process consisting of feature extraction, text representation, document clustering and cluster interpretation. Most text clustering models consider text as an unordered collection of words. However the semantics of text would be better captured if word sequences are taken into account.

In this paper we propose a sequence based text clustering model where four novel sequence based components are introduced in each of the four steps in the text clustering process.

Experiments conducted on the Reuters dataset and Sydney Morning Herald (SMH) news archives demonstrate the advantage of the proposed sequence based model, in terms of capturing context with semantics, accuracy and speed, compared to clustering of documents based on single words and n-gram based models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Dirichlet process mixture (DPM) model, a typical Bayesian nonparametric model, can infer the number of clusters automatically, and thus performing priority in data clustering. This paper investigates the influence of pairwise constraints in the DPM model. The pairwise constraint, known as two types: must-link (ML) and cannot-link (CL) constraints, indicates the relationship between two data points. We have proposed two relevant models which incorporate pairwise constraints: the constrained DPM (C-DPM) and the constrained DPM with selected constraints (SC-DPM). In C-DPM, the concept of chunklet is introduced. ML constraints are compiled into chunklets and CL constraints exist between chunklets. We derive the Gibbs sampling of the C-DPM based on chunklets. We further propose a principled approach to select the most useful constraints, which will be incorporated into the SC-DPM. We evaluate the proposed models based on three real datasets: 20 Newsgroups dataset, NUS-WIDE image dataset and Facebook comments datasets we collected by ourselves. Our SC-DPM performs priority in data clustering. In addition, our SC-DPM can be potentially used for short-text clustering.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a cluster ensemble method to map the corpus documents into the semantic space embedded in Wikipedia and group them using multiple types of feature space. A heterogeneous cluster ensemble is constructed with multiple types of relations i.e. document-term, document-concept and document-category. A final clustering solution is obtained by exploiting associations between document pairs and hubness of the documents. Empirical analysis with various real data sets reveals that the proposed meth-od outperforms state-of-the-art text clustering approaches.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Email overload is a recent problem that there is increasingly difficulty people have faced to process the large number of emails received daily. Currently this problem becomes more and more serious and it has already affected the normal usage of email as a knowledge management tool. It has been recognized that categorizing emails into meaningful groups can greatly save cognitive load to process emails and thus this is an effective way to manage email overload problem. However, most current approaches still require significant human input when categorizing emails. In this paper we develop an automatic email clustering system, underpinned by a new nonparametric text clustering algorithm. This system does not require any predefined input parameters and can automatically generate meaningful email clusters. Experiments show our new algorithm outperforms existing text clustering algorithms with higher efficiency in terms of computational time and clustering quality measured by different gauges.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We propose in this paper a novel sparse subspace clustering method that regularizes sparse subspace representation by exploiting the structural sharing between tasks and data points via group sparse coding. We derive simple, provably convergent, and computationally efficient algorithms for solving the proposed group formulations. We demonstrate the advantage of the framework on three challenging benchmark datasets ranging from medical record data to image and text clustering and show that they consistently outperforms rival methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This study is conducted within the IS-Impact Research Track at Queensland University of Technology (QUT). The goal of the IS-Impact Track is, “to develop the most widely employed model for benchmarking information systems in organizations for the joint benefit of both research and practice” (Gable et al, 2006). IS-Impact is defined as “a measure at a point in time, of the stream of net benefits from the IS, to date and anticipated, as perceived by all key-user-groups” (Gable Sedera and Chan, 2008). Track efforts have yielded the bicameral IS-Impact measurement model; the “impact” half includes Organizational-Impact and Individual-Impact dimensions; the “quality” half includes System-Quality and Information-Quality dimensions. The IS-Impact model, by design, is intended to be robust, simple and generalizable, to yield results that are comparable across time, stakeholders, different systems and system contexts. The model and measurement approach employ perceptual measures and an instrument that is relevant to key stakeholder groups, thereby enabling the combination or comparison of stakeholder perspectives. Such a validated and widely accepted IS-Impact measurement model has both academic and practical value. It facilitates systematic operationalization of a main dependent variable in research (IS-Impact), which can also serve as an important independent variable. For IS management practice it provides a means to benchmark and track the performance of information systems in use. The objective of this study is to develop a Mandarin version IS-Impact model, encompassing a list of China-specific IS-Impact measures, aiding in a better understanding of the IS-Impact phenomenon in a Chinese organizational context. The IS-Impact model provides a much needed theoretical guidance for this investigation of ES and ES impacts in a Chinese context. The appropriateness and soundness of employing the IS-Impact model as a theoretical foundation are evident: the model originated from a sound theory of IS Success (1992), developed through rigorous validation, and also derived in the context of Enterprise Systems. Based on the IS-Impact model, this study investigates a number of research questions (RQs). Firstly, the research investigated what essential impacts have been derived from ES by Chinese users and organizations [RQ1]. Secondly, we investigate which salient quality features of ES are perceived by Chinese users [RQ2]. Thirdly, we seek to answer whether the quality and impacts measures are sufficient to assess ES-success in general [RQ3]. Lastly, the study attempts to address whether the IS-Impact measurement model is appropriate for Chinese organizations in terms of evaluating their ES [RQ4]. An open-ended, qualitative identification survey was employed in the study. A large body of short text data was gathered from 144 Chinese users and 633 valid IS-Impact statements were generated from the data set. A generally inductive approach was applied in the qualitative data analysis. Rigorous qualitative data coding resulted in 50 first-order categories with 6 second-order categories that were grounded from the context of Chinese organization. The six second-order categories are: 1) System Quality; 2) Information Quality; 3) Individual Impacts;4) Organizational Impacts; 5) User Quality and 6) IS Support Quality. The final research finding of the study is the contextualized Mandarin version IS-Impact measurement model that includes 38 measures organized into 4 dimensions: System Quality, information Quality, Individual Impacts and Organizational Impacts. The study also proposed two conceptual models to harmonize the IS-Impact model and the two emergent constructs – User Quality and IS Support Quality by drawing on previous IS effectiveness literatures and the Work System theory proposed by Alter (1999) respectively. The study is significant as it is the first effort that empirically and comprehensively investigates IS-Impact in China. Specifically, the research contributions can be classified into theoretical contributions and practical contributions. From the theoretical perspective, through qualitative evidence, the study test and consolidate IS-Impact measurement model in terms of the quality of robustness, completeness and generalizability. The unconventional research design exhibits creativity of the study. The theoretical model does not work as a top-down a priori seeking for evidence demonstrating its credibility; rather, the study allows a competitive model to emerge from the bottom-up and open-coding analysis. Besides, the study is an example extending and localizing pre-existing theory developed in Western context when the theory is introduced to a different context. On the other hand, from the practical perspective, It is first time to introduce prominent research findings in field of IS Success to Chinese academia and practitioner. This study provides a guideline for Chinese organizations to assess their Enterprise System, and leveraging IT investment in the future. As a research effort in ITPS track, this study contributes the research team with an alternative operationalization of the dependent variable. The future research can take on the contextualized Mandarin version IS-Impact framework as a theoretical a priori model, further quantitative and empirical testing its validity.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Purpose – The work presented in this paper aims to provide an approach to classifying web logs by personal properties of users. Design/methodology/approach – The authors describe an iterative system that begins with a small set of manually labeled terms, which are used to label queries from the log. A set of background knowledge related to these labeled queries is acquired by combining web search results on these queries. This background set is used to obtain many terms that are related to the classification task. The system then ranks each of the related terms, choosing those that most fit the personal properties of the users. These terms are then used to begin the next iteration. Findings – The authors identify the difficulties of classifying web logs, by approaching this problem from a machine learning perspective. By applying the approach developed, the authors are able to show that many queries in a large query log can be classified. Research limitations/implications – Testing results in this type of classification work is difficult, as the true personal properties of web users are unknown. Evaluation of the classification results in terms of the comparison of classified queries to well known age-related sites is a direction that is currently being exploring. Practical implications – This research is background work that can be incorporated in search engines or other web-based applications, to help marketing companies and advertisers. Originality/value – This research enhances the current state of knowledge in short-text classification and query log learning. Classification schemes, Computer networks, Information retrieval, Man-machine systems, User interfaces

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Students' learning process can be negatively affected when their reading and comprehension control is not appropriated. This research focuses on the analysis of how a group of students from high school evaluate their reading comprehension in manipulated scientific texts. An analysis tool was designed to determine the students' degree of comprehension control when reading a scientific short text with an added contradiction. The results have revealed that the students from 1st and 3rd ESO do not properly self-evaluated their reading comprehension. A different behavior has been observed in 1st Bachillerato, where appropriate evaluation and regulation seem to be more frequent. Moreover, no significant differences have been found regarding the type of text, year or gender. Finally, as identified by previous research, the correlations between the students' comprehension control and their school marks have shown to have a weak relationship and inversely proportional to the students' age.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação de Mestrado apresentado ao Instituto de Contabilidade e Administração do Porto para a obtenção do grau de Mestre em Marketing Digital sob orientação de Sandrina Teixeira Anabela Ribeiro