Article ID: | iaor2009616 |
Country: | Netherlands |
Volume: | 178 |
Issue: | 3 |
Start Page Number: | 808 |
End Page Number: | 818 |
Publication Date: | May 2007 |
Journal: | European Journal of Operational Research |
Authors: | Singh Sumeetpal S., Tadi Vladislav B., Doucet Arnaud |
Keywords: | stochastic processes |
Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation.