Biblioteca Digital

**Autoria(s):** Silva, Joaquim; Mexia, Joao; Coelho, Carlos A.; Lopes, Gabriel
Data(s)	23/02/2014 23/02/2014 2004
Resumo	2000 Mathematics Subject Classification: 62H30 This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned.
Identificador	Pliska Studia Mathematica Bulgarica, Vol. 16, No 1, (2004), 207p-228p 0204-9805 http://hdl.handle.net/10525/2322
Idioma(s)	en
Publicador	Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Palavras-Chave	#Cluster Analysis #Applied Statistics #Document Clustering #Text Mining #Topics Extraction
Tipo	Article

Acesso ao item digital