Text segmentation using a cache memory


Autoria(s): Bigi, Brigitte; De Mori, Renato
Contribuinte(s)

Laboratoire Informatique d'Avignon (LIA) ; Université d'Avignon et des Pays de Vaucluse (UAPV) - Centre d'Enseignement et de Recherche en Informatique - CERI

Data(s)

2002

Resumo

International audience

This paper describes the application of an information-theoretic approach to document segmentation. Several segmentation criteria are proposed using topic shift detection or just blindly comparing the contents of cache memories where keywords are temporarily stored as a document is analyzed.Experiments with a large corpus of articles from the French newspaper Le Monde show tangible advantages when different models are combined with a suitable strategy. Experimental results show that different strategies for topic shift detection have to be used depending on whether high recall or high precision are sought. Furthermore, methods based on topic independent distributions provide complementary candidates with respect to the use of topic-dependent distributions leading to an increase in recall with a minor loss in precision.

Identificador

hal-01392346

https://hal.archives-ouvertes.fr/hal-01392346

Idioma(s)

en

Publicador

HAL CCSD

ACTA Press

Fonte

ISSN: 1480-1752

Control and Intelligent Systems

https://hal.archives-ouvertes.fr/hal-01392346

Control and Intelligent Systems, ACTA Press, 2002, 30 (3), pp.93-100

Palavras-Chave #Topic segmentation #[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL] #[SHS.INFO] Humanities and Social Sciences/Library and information sciences
Tipo

info:eu-repo/semantics/article

Journal articles