Adaptive policy-iteration and policy-value-iteration for discounted Markov decision processes

0.00 Avg rating—0 Votes

Article ID:	iaor19921465
Country:	Germany
Volume:	35
Start Page Number:	491
End Page Number:	503
Publication Date:	Jan 1991
Journal:	Mathematical Methods of Operations Research (Heidelberg)
Authors:	Hbner G., Schl M.

Abstract:

The paper is concerned with a discounted Markov decision process with an unknown parameter which is estimated anew at each stage. Algorithms are proposed which are intermediate between and include the ‘classical’ (but time consuming) principle of estimation and control and the simpler nonstationary value iteration which converges more slowly. These algorithms perform one single policy improvement step after each estimation, and then the policy thus obtained is evaluated completely (policy-iteration) or incompletely (policy-value-iteration). The results show that especially both these methods lead to asymptotically discount optimal policies. In addition, these results are generalized to cases where systematic errors will not vanish when the number of stages increases.

Reviews

Required fields are marked *. Your email address will not be published.