US 12,217,137 B1
	Meta-Q learning
Rasool Fakoor, San Jose, CA (US); Alexander Johannes Smola, Sunnyvale, CA (US); Stefano Soatto, Pasadena, CA (US); and Pratik Anil Chaudhari, Pasadena, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 30, 2020, as Appl. No. 17/039,447.
Int. Cl. G06N 20/00 (2019.01)

CPC G06N 20/00 (2019.01)

20 Claims

1. A computer-implemented method comprising:

receiving a set of training tasks;

initializing a replay buffer and parameters for a policy;

for each training task in the set of training tasks:

obtaining training data for the training task by interacting with an environment;

updating a context variable based on the training data;

storing the training data in the replay buffer; and

updating the parameters for the policy based on the training data to create a meta-trained policy;

performing adaptation on the meta-trained policy using task data and a subset of the training data to generate an adapted meta-trained policy, wherein the subset of the training data is identified using the context variable and a propensity score, wherein the propensity score indicates a similarity between the task data and at least one training task from the set of training tasks.