Article ID: | iaor200968815 |
Country: | United States |
Volume: | 34 |
Issue: | 2 |
Start Page Number: | 287 |
End Page Number: | 302 |
Publication Date: | May 2009 |
Journal: | Mathematics of Operations Research |
Authors: | Krishnamurthy Vikram, Wahlberg Bo |
Keywords: | bandit problems |
This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.