Enhanced n-gram extraction using relevance feature discovery


Autoria(s): Albathan, Mubarak; Li, Yuefeng; Algarni, Abdulmohsen
Contribuinte(s)

Cranefield, Stephen

Nayak, Abhaya

Data(s)

2013

Resumo

Guaranteeing the quality of extracted features that describe relevant knowledge to users or topics is a challenge because of the large number of extracted features. Most popular existing term-based feature selection methods suffer from noisy feature extraction, which is irrelevant to the user needs (noisy). One popular method is to extract phrases or n-grams to describe the relevant knowledge. However, extracted n-grams and phrases usually contain a lot of noise. This paper proposes a method for reducing the noise in n-grams. The method first extracts more specific features (terms) to remove noisy features. The method then uses an extended random set to accurately weight n-grams based on their distribution in the documents and their terms distribution in n-grams. The proposed approach not only reduces the number of extracted n-grams but also improves the performance. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms the state-of-art methods underpinned by Okapi BM25, tf*idf and Rocchio.

Identificador

http://eprints.qut.edu.au/67089/

Publicador

Springer

Relação

http://link.springer.com/chapter/10.1007%2F978-3-319-03680-9_46

DOI:10.1007/978-3-319-03680-9_46

Albathan, Mubarak, Li, Yuefeng, & Algarni, Abdulmohsen (2013) Enhanced n-gram extraction using relevance feature discovery. In Cranefield, Stephen & Nayak, Abhaya (Eds.) Proceedings of the 26th Australasian Joint Conference : AI2013 Advances in Artificial Intelligence, Springer, Dunedin, New Zealand, pp. 453-465.

Direitos

Copyright 2013 Springer International Publishing Switzerland

Fonte

School of Electrical Engineering & Computer Science; Science & Engineering Faculty

Palavras-Chave #Feature selection #N-gram #Terms weight #Relevance feedback
Tipo

Conference Paper