Article ID: | iaor20073850 |
Country: | United States |
Volume: | 53 |
Issue: | 1 |
Start Page Number: | 126 |
End Page Number: | 139 |
Publication Date: | Jan 2005 |
Journal: | Operations Research |
Authors: | Fu Michael C., Marcus Steven I., Chang Hyeong Soo, Hu Jiaqiao |
Keywords: | programming: dynamic, control processes |
Based on recent results for multiarmed bandit problems, we propose an adaptive sampling algorithm that approximates the optimal value of a finite-horizon Markov decision process (MDP) with finite state and action spaces. The algorithm adaptively chooses which action to sample as the sampling process proceeds and generates an asymptotically unbiased estimator, whose bias is bounded by a quantity that converges to zero at rate (ln