| Article ID: | iaor20061366 |
| Country: | United States |
| Volume: | 30 |
| Issue: | 3 |
| Start Page Number: | 545 |
| End Page Number: | 561 |
| Publication Date: | Aug 2005 |
| Journal: | Mathematics of Operations Research |
| Authors: | Tsitsiklis John N., Mannor Shie |
We consider the empirical state–action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards.