Effectiveness of document representation for classification


Autoria(s): Chen, D.; Li, X.; Dong, Z. Y.; Chen, X.
Contribuinte(s)

Tjoa, A. M.

Trujillo, J.

Data(s)

01/01/2005

Resumo

Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.

Identificador

http://espace.library.uq.edu.au/view/UQ:103088

Idioma(s)

eng

Publicador

Springer

Palavras-Chave #E1 #280103 Information Storage, Retrieval and Management #700103 Information processing services
Tipo

Conference Paper