US 12,265,592 B2
Model aggregation for fitted Q-evaluation
Kohei Miyaguchi, Tokyo (JP)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Dec. 9, 2021, as Appl. No. 17/546,629.
Prior Publication US 2023/0185877 A1, Jun. 15, 2023
Int. Cl. G06F 17/17 (2006.01); G06Q 10/0639 (2023.01); G06Q 30/0601 (2023.01); G06Q 50/04 (2012.01)
CPC G06F 17/17 (2013.01) [G06Q 10/06393 (2013.01); G06Q 30/0631 (2013.01); G06Q 50/04 (2013.01)] 25 Claims
OG exemplary drawing
 
1. A computer-implemented method for value function estimation, comprising:
(a) obtaining offline data D and a policy π, the offline data D including a set of tuples of a state, an action, a reward, and a resulting state;
(b) setting an initial value of an estimated value function Q0π to zero and an initial value of a time step h to 1;
(c) computing, for each of candidate models {Mk}k=1K of an environment, a bootstrapping estimator δk that estimates a value function Qhπ at time step h based on the offline data D and an estimated value function Qh-1π at time step h−1 to obtain a candidate of an estimated value function Qk,h for each of the candidate models {Mk}k=1K, where k denotes the index of the models and K denotes the number of the models;
(d) computing, for each candidate of the estimated value function Qk,h, a model selection criterion C(Qk,h; D, π, Qh-1π), where the model selection criterion C(Qk,h; D, π, Qh-1π) is a function to quantify a negative quality of the candidate of the estimated value function Qk,h based on the offline data D, a policy π to evaluate, and the estimated value function Qh-1π at time step h−1;
(e) selecting a candidate of an estimated value function Qk*,h as an estimated value function Qhπ at time step h, where Qk*,h has a minimum value of the criterion C(Qk*,h; D, Qh-1π) among each candidate of the estimated value function Qk,h;
(f) repeating steps (c) to (f) with incrementing time step h until time step h reaches an end of a time step H; and
(g) outputting an estimated value function QHπ at the time step H.