| CPC G06F 17/17 (2013.01) [G06Q 10/06393 (2013.01); G06Q 30/0631 (2013.01); G06Q 50/04 (2013.01)] | 25 Claims |

|
1. A computer-implemented method for value function estimation, comprising:
(a) obtaining offline data D and a policy π, the offline data D including a set of tuples of a state, an action, a reward, and a resulting state;
(b) setting an initial value of an estimated value function Q0π to zero and an initial value of a time step h to 1;
(c) computing, for each of candidate models {Mk}k=1K of an environment, a bootstrapping estimator δk that estimates a value function Qhπ at time step h based on the offline data D and an estimated value function Qh-1π at time step h−1 to obtain a candidate of an estimated value function Qk,h for each of the candidate models {Mk}k=1K, where k denotes the index of the models and K denotes the number of the models;
(d) computing, for each candidate of the estimated value function Qk,h, a model selection criterion C(Qk,h; D, π, Qh-1π), where the model selection criterion C(Qk,h; D, π, Qh-1π) is a function to quantify a negative quality of the candidate of the estimated value function Qk,h based on the offline data D, a policy π to evaluate, and the estimated value function Qh-1π at time step h−1;
(e) selecting a candidate of an estimated value function Qk*,h as an estimated value function Qhπ at time step h, where Qk*,h has a minimum value of the criterion C(Qk*,h; D, Qh-1π) among each candidate of the estimated value function Qk,h;
(f) repeating steps (c) to (f) with incrementing time step h until time step h reaches an end of a time step H; and
(g) outputting an estimated value function QHπ at the time step H.
|