Text classification using compression-based dissimilarity measures


Autoria(s): Coutinho, David Pereira; Figueiredo, Mário A. T.
Data(s)

03/05/2016

03/05/2016

01/08/2015

Resumo

Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.

Identificador

COUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23. 2015

0218-0014

1793-6381

http://hdl.handle.net/10400.21/6144

10.1142/S0218001415530043

Idioma(s)

eng

Publicador

World Scientific Publications CO PTE LTD

Relação

1553004

Direitos

closedAccess

Palavras-Chave #Text classification #Text similarity measures #Relative entropy #Ziv-Merhav method #Cross-parsing algorithm
Tipo

article