CPC G06N 3/08 (2013.01) [G06N 3/044 (2023.01)] | 20 Claims |
1. A method performed by a system of one or more computers and for training a controller neural network having a plurality of controller parameters to generate output sequences by determining trained values of the controller parameters from initial values of the controller parameters, the method comprising:
maintaining data identifying a set of K output sequences that were previously generated by the controller neural network during the training and, for each output sequence in the set, a respective reward that measures a quality of the output sequence, wherein K is an integer greater than one;
in each of a plurality of iterations, performing:
determining a first update to the current values of the controller parameters using one or more output sequences selected from the set of K output sequences;
generating a batch of new output sequences using the controller neural network in accordance with the current values of the controller parameters;
obtaining a respective reward for each of the new output sequences;
determining, from the new output sequences and the output sequences in the maintained data, the K output sequences that have the highest rewards; and
modifying the maintained data to identify the determined K output sequences and the respective reward for each of the K output sequences.
|