The Learning Component of Dynamic Allocation Indices


Autoria(s): Gittins, J.; Wang, Y-G.
Data(s)

1992

Resumo

For a multiarmed bandit problem with exponential discounting the optimal allocation rule is defined by a dynamic allocation index defined for each arm on its space. The index for an arm is equal to the expected immediate reward from the arm, with an upward adjustment reflecting any uncertainty about the prospects of obtaining rewards from the arm, and the possibilities of resolving those uncertainties by selecting that arm. Thus the learning component of the index is defined to be the difference between the index and the expected immediate reward. For two arms with the same expected immediate reward the learning component should be larger for the arm for which the reward rate is more uncertain. This is shown to be true for arms based on independent samples from a fixed distribution with an unknown parameter in the cases of Bernoulli and normal distributions, and similar results are obtained in other cases.

Identificador

http://eprints.qut.edu.au/90633/

Publicador

Institute of Mathematical Statistics

Relação

DOI:10.1214/aos/1176348788

Gittins, J. & Wang, Y-G. (1992) The Learning Component of Dynamic Allocation Indices. The Annals of Statistics, 20(3), pp. 1625-1636.

Direitos

Copyright Institute of Mathematical Statistics

Fonte

Science & Engineering Faculty

Palavras-Chave #dynamic allocation index #gittins index #multiarmed bandit #target #processes
Tipo

Journal Article