Article ID: | iaor2002910 |
Country: | United States |
Volume: | 38 |
Issue: | 1 |
Start Page Number: | 94 |
End Page Number: | 123 |
Publication Date: | Dec 1999 |
Journal: | SIAM Journal on Control and Optimization |
Authors: | Borkar V.S., Konda V.R. |
Keywords: | learning |
Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known ‘actor-critic’ (or ‘adaptive critic’) algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.