| Article ID: | iaor1999869 |
| Country: | United States |
| Volume: | 6 |
| Issue: | 6 |
| Start Page Number: | 1185 |
| End Page Number: | 1201 |
| Publication Date: | Nov 1994 |
| Journal: | Neural Computation |
| Authors: | Singh S.P., Jaakkola T., Jordan M.I. |
| Keywords: | programming: dynamic, artificial intelligence |
Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(λ) algorithm of Sutton and the Q-learning algorithm of Watkins, can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(λ) and Q-learning belong.