Article ID: | iaor2003707 |
Country: | Germany |
Volume: | 54 |
Issue: | 1 |
Start Page Number: | 63 |
End Page Number: | 99 |
Publication Date: | Jan 2001 |
Journal: | Mathematical Methods of Operations Research (Heidelberg) |
Authors: | Cavaros-Cadena R. |
This note concerns discrete-time Markov decision processes with denumerable state space. A control policy is graded by the long-run expected average reward criterion, and the main feature of the model is that the reward function and the transition law depend on an unknown parameter. Besides standard continuity-compactness restrictions, it is supposed that the controller can use the observed history to generate a consistent estimation scheme, and that the system's transition-reward structure satisfies an adaptive version of the Lyapunov function condition. Within this context, a special implementation of the non stationary value iteration method is studied, and it is shown that this technique produces convergent approximations to the solution of the optimality equation, result that is used to build an optimal adaptive policy.