Compositional data analysis (CoDA) approaches to distance in information retrieval


Autoria(s): Thomas, P.; Lovell, D. R.
Data(s)

2014

Resumo

Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole - term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering. Copyright 2014 ACM.

Identificador

http://eprints.qut.edu.au/79872/

Publicador

Association for Computing Machinery

Relação

DOI:10.1145/2600428.2609492

Thomas, P. & Lovell, D. R. (2014) Compositional data analysis (CoDA) approaches to distance in information retrieval. In SIGIR '14 Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, Association for Computing Machinery, Gold Coast, Qld., pp. 991-994.

Direitos

ACM

Fonte

School of Electrical Engineering & Computer Science; Science & Engineering Faculty

Palavras-Chave #Aitchison's distance #Compositions #Distance #Ratio #Similarity #Chemical analysis #Compositional data #Compositional data analysis #Relative information #Through the lens #Information retrieval
Tipo

Conference Paper