Using Kullback-Leibler Distance for Text Categorization


Autoria(s): Bigi, Brigitte
Contribuinte(s)

ADELE (LIG Laboratoire d'Informatique de Grenoble) ; Université Pierre Mendès France - Grenoble 2 (UPMF) - Université Joseph Fourier - Grenoble 1 (UJF) - Institut National Polytechnique de Grenoble (INPG) - Centre National de la Recherche Scientifique (CNRS)

Data(s)

2003

Resumo

International audience

A system that performs text categorization aims to assign appropriate categories from a predefined classification scheme to incoming documents. These assignments might be used for varied purposes such as filtering, or retrieval. This paper introduces a new effective model for text categorization with great corpus (more or less 1 million documents). Text categorization is performed using the Kullback-Leibler distance between the probability distribution of the document to classify and the probability distribution of each category. Using the same representation of categories, experiments show a significant improvement when the above mentioned method is used. KLD method achieve substantial improvements over the tfidf performing method.

Identificador

hal-01392500

https://hal.archives-ouvertes.fr/hal-01392500

DOI : 10.1007/3-540-36618-0_22

Idioma(s)

en

Publicador

HAL CCSD

Springer Berlin Heidelberg

Relação

info:eu-repo/semantics/altIdentifier/doi/10.1007/3-540-36618-0_22

Fonte

Advances in Information Retrieval

https://hal.archives-ouvertes.fr/hal-01392500

Advances in Information Retrieval, 2633, Springer Berlin Heidelberg, pp.305-319, 2003, <10.1007/3-540-36618-0_22>

Palavras-Chave #Text categorization #Kullback-Leibler Divergence #[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] #[SHS.INFO] Humanities and Social Sciences/Library and information sciences
Tipo

info:eu-repo/semantics/bookPart

Book section