Biblioteca Digital

Text classification using compression-based dissimilarity measures

**Autoria(s):** Coutinho, David Pereira; Figueiredo, Mário A. T.
Data(s)	03/05/2016 03/05/2016 01/08/2015
Resumo	Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
Identificador	COUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23. 2015 0218-0014 1793-6381 http://hdl.handle.net/10400.21/6144 10.1142/S0218001415530043
Idioma(s)	eng
Publicador	World Scientific Publications CO PTE LTD
Relação	1553004
Direitos	closedAccess
Palavras-Chave	#Text classification #Text similarity measures #Relative entropy #Ziv-Merhav method #Cross-parsing algorithm
Tipo	article

Acesso ao item digital