Report on an investigation carried out under Contract Nonr-2771 (04), Project NR 388-062, between the University of Oregon and the Office of Naval Research, U.S. Department of the Navy.
In a simulator-defined MDP, the Markovian dynamics and rewards are provided in the form of a simulator from which samples can be drawn. This paper studies MDP planning algorithms that attempt to minimize the number of simulator calls before terminating and outputting a policy that is approximately optimal with high...