| Article ID: | iaor200968815 |
| Country: | United States |
| Volume: | 34 |
| Issue: | 2 |
| Start Page Number: | 287 |
| End Page Number: | 302 |
| Publication Date: | May 2009 |
| Journal: | Mathematics of Operations Research |
| Authors: | Krishnamurthy Vikram, Wahlberg Bo |
| Keywords: | bandit problems |
This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.