US 12,265,592 B2
	Model aggregation for fitted Q-evaluation
Kohei Miyaguchi, Tokyo (JP)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Dec. 9, 2021, as Appl. No. 17/546,629.
Prior Publication US 2023/0185877 A1, Jun. 15, 2023
Int. Cl. G06F 17/17 (2006.01); G06Q 10/0639 (2023.01); G06Q 30/0601 (2023.01); G06Q 50/04 (2012.01)

CPC G06F 17/17 (2013.01) [G06Q 10/06393 (2013.01); G06Q 30/0631 (2013.01); G06Q 50/04 (2013.01)]

25 Claims

1. A computer-implemented method for value function estimation, comprising:

(a) obtaining offline data D and a policy π, the offline data D including a set of tuples of a state, an action, a reward, and a resulting state;

(b) setting an initial value of an estimated value function Q₀^π to zero and an initial value of a time step h to 1;

(c) computing, for each of candidate models {M_k}_k=1^Kof an environment, a bootstrapping estimator δ_kthat estimates a value function Q_h^π at time step h based on the offline data D and an estimated value function Q_h-1^π at time step h−1 to obtain a candidate of an estimated value function Q_k,hfor each of the candidate models {M_k}_k=1^K, where k denotes the index of the models and K denotes the number of the models;

(d) computing, for each candidate of the estimated value function Q_k,h, a model selection criterion C(Q_k,h; D, π, Q_h-1^π), where the model selection criterion C(Q_k,h; D, π, Q_h-1^π) is a function to quantify a negative quality of the candidate of the estimated value function Q_k,hbased on the offline data D, a policy π to evaluate, and the estimated value function Q_h-1^π at time step h−1;

(e) selecting a candidate of an estimated value function Q_k*,has an estimated value function Q_h^π at time step h, where Q_k*,hhas a minimum value of the criterion C(Q_k*,h; D, Q_h-1^π) among each candidate of the estimated value function Q_k,h;

(f) repeating steps (c) to (f) with incrementing time step h until time step h reaches an end of a time step H; and

(g) outputting an estimated value function Q_H^π at the time step H.