US 11,693,706 B2
	System and method for dynamic scheduling of distributed deep learning training jobs
Timothy Capes, Toronto (CA); Iqbal Mohomed, Toronto (CA); Vishal Raheja, Vancouver (CA); and Mete Kemertas, Toronto (CA)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Gyeonggi-do (KR)
Filed on Nov. 21, 2019, as Appl. No. 16/690,999.
Claims priority of provisional application 62/770,377, filed on Nov. 21, 2018.
Prior Publication US 2020/0159589 A1, May 21, 2020
Int. Cl. G06F 9/50 (2006.01); G06V 10/82 (2022.01); G06N 3/08 (2023.01); G06N 7/08 (2006.01); G06F 18/214 (2023.01); G06N 5/01 (2023.01); G06V 10/764 (2022.01); G06V 10/94 (2022.01); G06V 10/96 (2022.01); G06N 3/084 (2023.01)

CPC G06F 9/5038 (2013.01) [G06F 9/5061 (2013.01); G06F 18/214 (2023.01); G06N 3/08 (2013.01); G06N 3/084 (2013.01); G06N 5/01 (2023.01); G06N 7/08 (2013.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 10/955 (2022.01); G06V 10/96 (2022.01)]

17 Claims

1. A method of scheduling a plurality of jobs to a plurality of processing units (PUs), wherein the plurality of jobs comprises J jobs, the J being two or more, and the plurality of PUs comprises C PUs, the C being two or more, the method comprising:

initializing by provisionally assigning one PU to each job of the plurality of jobs; and

iteratively identifying and provisionally assigning to determine an allocation from the plurality of PUs to the plurality of jobs, wherein the iteratively identifying and the provisionally assigning, within an iteration, comprises:

at a first step within the iteration, identifying a greatest Time Improvement for running of job per PU based on a doubling heuristic for providing with a doubling of a number of PUs assigned, wherein the iteratively identifying identifies, at the first step within the iteration, that the greatest Time Improvement is associated with a particular job of the iteration,

at a second step within the iteration, provisionally assigning to the particular job of the iteration, an additional first number of PUs, wherein the additional first number of PUs doubles a number of PUs provisionally assigned to the particular job of the iteration, and

stopping the iteratively identifying and the provisionally assigning when a stopping condition is reached, wherein the stopping condition comprises a condition that no more PUs can be further provisionally assigned to any job of the plurality of jobs,

wherein the Time Improvement is determined based on the number of PUs assigned to a j^thjob, training speed function of the j^thjob and estimated number of epochs needed to complete the j^thjob, wherein an epoch corresponds to one set of training data, and wherein j is evaluated from 1, . . . , J within any iteration.