Partially observed Markov decision process multiarmed bandits–structural results

0.00 Avg rating—0 Votes

Article ID:	iaor200968815
Country:	United States
Volume:	34
Issue:	2
Start Page Number:	287
End Page Number:	302
Publication Date:	May 2009
Journal:	Mathematics of Operations Research
Authors:	Krishnamurthy Vikram, Wahlberg Bo
Keywords:	bandit problems

Abstract:

This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.

Reviews

Required fields are marked *. Your email address will not be published.