A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters


Autoria(s): Silva, Joaquim; Mexia, Joao; Coelho, Carlos A.; Lopes, Gabriel
Data(s)

23/02/2014

23/02/2014

2004

Resumo

2000 Mathematics Subject Classification: 62H30

This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned.

Identificador

Pliska Studia Mathematica Bulgarica, Vol. 16, No 1, (2004), 207p-228p

0204-9805

http://hdl.handle.net/10525/2322

Idioma(s)

en

Publicador

Institute of Mathematics and Informatics Bulgarian Academy of Sciences

Palavras-Chave #Cluster Analysis #Applied Statistics #Document Clustering #Text Mining #Topics Extraction
Tipo

Article