US 12,229,675 B2
Adaptive optimization with improved convergence
Sashank Jakkam Reddi, Jersey City, NJ (US); Sanjiv Kumar, Jericho, NY (US); and Satyen Chandrakant Kale, New York, NY (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 14, 2022, as Appl. No. 18/081,403.
Application 18/081,403 is a continuation of application No. 16/130,058, filed on Sep. 13, 2018, granted, now 11,586,904.
Prior Publication US 2023/0113984 A1, Apr. 13, 2023
Int. Cl. G06N 3/08 (2023.01); G06F 17/16 (2006.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)
CPC G06N 3/08 (2013.01) [G06F 17/16 (2013.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for iteratively training a model over a plurality of iterations based on a set of training data, wherein the model is parameterized by a set of parameters, and wherein the method comprises:
in a first training iteration of the plurality of iterations:
determining, by one or more computing devices, a first candidate learning rate control value based on a first current gradient for a loss function for the model, a current subset of the set of training data, and current values for the set of parameters;
setting, by the one or more computing devices, a first current maximum observed learning rate control value to be equivalent to a greater of two values comprising the first candidate learning rate control value and a first maximum previously observed learning rate control value,
wherein the first maximum previously observed learning rate control value is a maximum value of all previous candidate learning rate control values determined during all previous iterations of the set of iterations that are previous relative to the first training iteration;
determining, by the one or more computing devices, a first current learning rate based at least in part on the first current maximum observed learning rate control value; and
training, by the one or more computing devices, the model via updating the current values for the set of parameters based on the determined first current learning rate, the first current gradient of the loss function for the model, and the current subset of the set of training data; and
in a second training iteration of the plurality of iterations:
determining, by the one or more computing devices, a second candidate learning rate control value based on a second current gradient for the loss function for the model the current subset of the set of training data, and the current values for the set of parameters;
setting by the one or more computing devices, a second current maximum observed learning rate control value to be equivalent to a greater of two values comprising the second candidate learning rate control value and a second maximum previously observed learning rate control value,
wherein the second maximum previously observed learning rate control value is a maximum value of all previous candidate learning rate control values determined during all previous iterations of the set of iterations that are previous relative to the second training iteration;
determining, by the one or more computing devices a second current learning rate based at least in art on the second current maximum observed learning rate control value; and
training, by the one or more computing devices the model via updating the current values for the set of parameters based on the determined second current learning rate, the second current gradient of the loss function for the model, and the current subset of the set of training data.