Semi-supervised document clustering via loci


Autoria(s): Sutanto, Taufik; Nayak, Richi
Contribuinte(s)

Wang, Jianyong

Cellary, Wojciech

Wang, Dingding

Wang, Hua

Chen, Shu-Ching

Li, Tao

Zhang, Yanchun

Data(s)

01/11/2015

Resumo

Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.

Formato

application/pdf

Identificador

http://eprints.qut.edu.au/89750/

Publicador

Springer International Publishing

Relação

http://eprints.qut.edu.au/89750/1/WISE2015052.pdf

DOI:10.1007/978-3-319-26187-4_16

Sutanto, Taufik & Nayak, Richi (2015) Semi-supervised document clustering via loci. Lecture Notes in Computer Science, 9419, pp. 208-215.

Direitos

Copyright 2015 Springer International Publishing Switzerland

The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-26187-4_16

Fonte

School of Electrical Engineering & Computer Science; Institute for Creative Industries and Innovation; Science & Engineering Faculty

Palavras-Chave #080109 Pattern Recognition and Data Mining #Loci #Ranking #Semi-supervised clustering
Tipo

Journal Article