Article ID: | iaor19911688 |
Country: | Switzerland |
Volume: | 28 |
Start Page Number: | 297 |
End Page Number: | 312 |
Publication Date: | Apr 1991 |
Journal: | Annals of Operations Research |
Authors: | Yakowitz S., Lowe W. |
Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as a bandit ‘arm’) and observes an outcome. The selection is based on the observation history. The objective is to sequentially choose arms so as to minimize growth (with decision time) rate of the number of suboptimal selections. The appellation ‘bandit’ refers to mechanical gambling machines, and the tradition stems from the question of allocating competing treatments to a sequence of patients having the same disease. The present motivation is ‘machine learning’ in which a game-playing or assembly-line adjusting computer is faced with a sequence of statisically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems. The setting for the present study is nonparametric and infinite horizon. The central aim is to relate a methodology which postulates finite moments or, alternatively, bounded bandit arms. Under these circumstances, strategies proposed are shown to be asymptotically optimal and converge at guaranteed rates. In the bounded-arm case, the rate is optimal. The authors extend the theory to the case in which the bandit population is infinite, and share some computational experience.