| Article ID: | iaor2009616 |
| Country: | Netherlands |
| Volume: | 178 |
| Issue: | 3 |
| Start Page Number: | 808 |
| End Page Number: | 818 |
| Publication Date: | May 2007 |
| Journal: | European Journal of Operational Research |
| Authors: | Singh Sumeetpal S., Tadi Vladislav B., Doucet Arnaud |
| Keywords: | stochastic processes |
Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation.