Article ID: | iaor20061366 |
Country: | United States |
Volume: | 30 |
Issue: | 3 |
Start Page Number: | 545 |
End Page Number: | 561 |
Publication Date: | Aug 2005 |
Journal: | Mathematics of Operations Research |
Authors: | Tsitsiklis John N., Mannor Shie |
We consider the empirical state–action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards.