A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters
Data(s) |
23/02/2014
23/02/2014
2004
|
---|---|
Resumo |
2000 Mathematics Subject Classification: 62H30 This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned. |
Identificador |
Pliska Studia Mathematica Bulgarica, Vol. 16, No 1, (2004), 207p-228p 0204-9805 |
Idioma(s) |
en |
Publicador |
Institute of Mathematics and Informatics Bulgarian Academy of Sciences |
Palavras-Chave | #Cluster Analysis #Applied Statistics #Document Clustering #Text Mining #Topics Extraction |
Tipo |
Article |