Clustering and labeling a web scale document collection using Wikipedia clusters


Autoria(s): Nayak, Richi; Mills, Rachel; De-Vries, Christopher; Geva, Shlomo
Contribuinte(s)

Yi, Zeng

Kotoulas, Spyros

Huang, Zhisheng

Data(s)

2014

Resumo

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Identificador

http://eprints.qut.edu.au/80070/

Publicador

ACM New York, NY, USA

Relação

DOI:10.1145/2663792.2663803

Nayak, Richi, Mills, Rachel, De-Vries, Christopher, & Geva, Shlomo (2014) Clustering and labeling a web scale document collection using Wikipedia clusters. In Yi, Zeng, Kotoulas, Spyros, & Huang, Zhisheng (Eds.) Web-KR '14 Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, ACM New York, NY, USA, Shanghai, China, pp. 23-30.

Fonte

School of Electrical Engineering & Computer Science; Science & Engineering Faculty

Palavras-Chave #Document clustering #Big data #Wikipedia #ClueWeb #Document signature
Tipo

Conference Paper