US 12,242,928 B1
	Artificial intelligence system providing automated distributed training of machine learning models
Xianshun Chen, Seattle, WA (US); Kai Liu, Bathell, WA (US); Nikhil Anand Navali, Seattle, WA (US); and Archiman Dutta, Shoreline, WA (US)
Assigned to Amazon Technologies, Inc.
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 19, 2020, as Appl. No. 16/824,480.
Int. Cl. G06N 20/00 (2019.01); G06N 5/01 (2023.01)

CPC G06N 20/00 (2019.01) [G06N 5/01 (2023.01)]

19 Claims

1. A system, comprising:

one or more processors and corresponding memory of one or more computing devices;

wherein the memory of one or more computing devices include instructions that upon execution on or across the one or more processors cause the one or more computing devices to:

identify a collection of compute resources for a machine learning training task comprising training, in parallel across the collection of compute resources, one or more machine learning models;

generate a plurality of control descriptors, including a first control descriptor and a second control descriptor, wherein individual control descriptors of the control descriptors indicate at least a training algorithm for training a model of the one or more models and values of one or more hyper-parameters of the training algorithm, wherein the first control descriptor indicates a different training algorithm or a different hyperparameter value than indicated by the second control descriptor;

assign respective unique descriptor identifiers to individual control descriptors of the plurality of control descriptors;

assign, to individual records of a plurality of records of a training data set of the machine learning training task, a respective batch identifier selected from a plurality of batch identifiers, wherein the batch identifier of a first subset of the plurality of records differs from the batch identifier of a second subset of the plurality of records;

generate a plurality of tuples, wherein individual tuples of the tuples indicate at least a respective record of the plurality of records of the training data set and a respective control descriptor of the plurality of control descriptors comprising the first and second control descriptors, wherein the number of tuples in the plurality of tuples is equal to the product of (a) the number of records of the training data set and (b) the number of control descriptors, indicating at least a training algorithm for training a model of the one or more models and values of one or more hyper-parameters of the training algorithm, of the plurality of control descriptors; and

cause a plurality of batch training iterations to be performed to train the one or more models, wherein a particular batch training iteration comprises:

identify a subset of the plurality of tuples whose records of the training data set were assigned a batch identifier corresponding to the batch training iteration;

distribute, using at least the descriptor identifiers indicated in the subset, the subset of tuples among the plurality of compute resources of the collection such that the number of distinct control descriptors; of the tuples distributed to an individual compute resource is no greater than a threshold; and

in parallel at different compute resources of the plurality of compute resources and in accordance with corresponding training algorithms indicated by respective control descriptors of respective tuples of the distributed tuples a new version of the one or more machine learning models using the records in the distributed tuples as training data sets, wherein a first tuple of the distributed tuples indicates the first control descriptor and a second tuple of the distributed tuples indicates the second control descriptor indicating a different training algorithm or a different hyperparameter value than indicated by the first control descriptor, and wherein the first tuple and the second tuple indicate the same records assigned the batch identifier corresponding to the batch training iteration, such that a first compute resource to which the first tuple is distributed and a second compute resource to which the second tuple is distributed use the same records of the training data set for the training and use the different training algorithm or the different hyperparameter value for the training, during the batch training iteration, and wherein the training comprises modifying a previous version of the machine learning model which was trained in an earlier iteration; and

provide, after a final batch training iteration of the plurality of batch training iterations has been completed, an indication of respective results obtained using trained versions of the one or more machine learning models for each of the first control descriptor and the second control descriptor in the final batch training iteration.