Accelerating Large Scale Centroid-based Clustering with Locality Sensitive Hashing


Autoria(s): McConville, Ryan; Cao, Xin; Liu, Weiru; Miller, Paul
Data(s)

01/05/2016

Identificador

http://pure.qub.ac.uk/portal/en/publications/accelerating-large-scale-centroidbased-clustering-with-locality-sensitive-hashing(6465140d-241e-49c5-b7d3-2d9581608d98).html

http://dx.doi.org/10.1109/ICDE.2016.7498278

http://pure.qub.ac.uk/ws/files/54939727/ICDE16_camera_version.pdf

http://icde2016.fi/

Idioma(s)

eng

Publicador

Institute of Electrical and Electronics Engineers (IEEE)

Direitos

info:eu-repo/semantics/openAccess

Fonte

McConville , R , Cao , X , Liu , W & Miller , P 2016 , Accelerating Large Scale Centroid-based Clustering with Locality Sensitive Hashing . in Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE) . Institute of Electrical and Electronics Engineers (IEEE) , pp. 649-660 , 32nd IEEE International Conference on Data Engineering , Helsinki , Finland , 16-20 May . DOI: 10.1109/ICDE.2016.7498278

Tipo

contributionToPeriodical

Resumo

Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing clustering algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing (LSH) to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.

Formato

application/pdf