| Article ID: | iaor2002910 |
| Country: | United States |
| Volume: | 38 |
| Issue: | 1 |
| Start Page Number: | 94 |
| End Page Number: | 123 |
| Publication Date: | Dec 1999 |
| Journal: | SIAM Journal on Control and Optimization |
| Authors: | Borkar V.S., Konda V.R. |
| Keywords: | learning |
Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known ‘actor-critic’ (or ‘adaptive critic’) algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.