Compositional data analysis (CoDA) approaches to distance in information retrieval
Data(s) |
2014
|
---|---|
Resumo |
Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole - term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering. Copyright 2014 ACM. |
Identificador | |
Publicador |
Association for Computing Machinery |
Relação |
DOI:10.1145/2600428.2609492 Thomas, P. & Lovell, D. R. (2014) Compositional data analysis (CoDA) approaches to distance in information retrieval. In SIGIR '14 Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, Association for Computing Machinery, Gold Coast, Qld., pp. 991-994. |
Direitos |
ACM |
Fonte |
School of Electrical Engineering & Computer Science; Science & Engineering Faculty |
Palavras-Chave | #Aitchison's distance #Compositions #Distance #Ratio #Similarity #Chemical analysis #Compositional data #Compositional data analysis #Relative information #Through the lens #Information retrieval |
Tipo |
Conference Paper |