US 12,277,480 B1
	In-flight scaling of machine learning training jobs
Edo Liberty, New York, NY (US); Thomas Albert Faulhaber, Jr., Seattle, WA (US); Zohar Karnin, Hoboken, NJ (US); Gowda Dayananda Anjaneyapura Range, Redmond, WA (US); Amir Sadoughi, New York, NY (US); Swaminathan Sivasubramanian, Sammamish, WA (US); Alexander Johannes Smola, Sunnyvale, CA (US); Stefano Stefani, Issaquah, WA (US); and Craig Wiley, Redmond, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 23, 2018, as Appl. No. 15/934,091.
Claims priority of provisional application 62/590,134, filed on Nov. 22, 2017.
Int. Cl. G06N 20/00 (2019.01); G06F 9/48 (2006.01)

CPC G06N 20/00 (2019.01) [G06F 9/4881 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

executing code on a computing device including a processor to implement a model training system in a provider network;

executing code on another computing device including a processor to implement a training control system in the provider network;

executing a machine learning (ML) training job using a first one or more compute instances of the model training system, wherein each of the first one or more compute instances performs a plurality of iterations of a work routine for the ML training job, each iteration including obtaining an identifier of a work unit from a progress manager of the training control system, obtaining a current state associated with the ML training job from a parameter server of the training control system, executing logic to update the current state associated with the ML training job after completing the work unit, sending the updated current state associated with the ML training job to the parameter server of the training control system, and sending a message to the progress manager of the training control system indicating that the work unit is complete;

determining, by the progress manager of the training control system, that a first compute instance from among the first one or more compute instances of the model training system obtained a first identifier of a first work unit for the ML training job from the progress manager of the training control system at a first time and updated the current state of the first work unit for the ML training job at a second time;

deriving, by the progress manager of the training control system from the first time and the second time, a speed of the ML training job;

determining, by the progress manager of the training control system by comparing a predicted training time to a threshold and based at least in part on the deriving of the speed of the ML training job, that a progress of the ML training job is not satisfactory;

adding, by the model training system based at least in part on the determining that the progress of the ML training job is not satisfactory, a second one or more compute instances of the model training system to the ML training job while the first one or more compute instances of the model training system continue to execute portions of the ML training job, wherein the second one or more compute instances are of a different type than the first one or more compute instances, and wherein the different type is selected based on an execution characteristic of the ML training job that is determined to be problematic hindering performance of the ML training job;

providing, to the second one or more compute instances of the model training system, an identifier of the parameter server and an identifier of a set of data that the second one or more compute instances of the model training system is to process;

determining, by the progress manager of the training control system, that the progress of the ML training job is satisfactory; and

removing, by the model training system based at least in part on the determining that the progress of the ML training job is satisfactory, one or more compute instances of the second one or more compute instances of the model training system or of the first one or more compute instances of the model training system from the ML training job.