| CPC G06N 20/00 (2019.01) | 20 Claims |

|
1. A computer-implemented method comprising:
receiving a set of training tasks;
initializing a replay buffer and parameters for a policy;
for each training task in the set of training tasks:
obtaining training data for the training task by interacting with an environment;
updating a context variable based on the training data;
storing the training data in the replay buffer; and
updating the parameters for the policy based on the training data to create a meta-trained policy;
performing adaptation on the meta-trained policy using task data and a subset of the training data to generate an adapted meta-trained policy, wherein the subset of the training data is identified using the context variable and a propensity score, wherein the propensity score indicates a similarity between the task data and at least one training task from the set of training tasks.
|