Sample mean based index policies with O(logn) regret for the multi-armed bandit problem

0.00 Avg rating—0 Votes

Article ID:	iaor1997407
Country:	United Kingdom
Volume:	27
Issue:	4
Start Page Number:	1054
End Page Number:	1078
Publication Date:	Dec 1995
Journal:	Advances in Applied Probability
Authors:	Agrawal Rajeev
Keywords:	control processes

Abstract:

The paper considers a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(logn) lower bound on the regret with a constant that depends on the Kullback-Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. This paper constructs index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O(logn) regret with a constant that is also based on the Kullback-Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction’ principle. The present results rely entirely on a few key lemmas from the theory of large deviations.

Reviews

Required fields are marked *. Your email address will not be published.