US 12,190,147 B2
	Transparent pre-emption and migration for planet-scale computer
Muthian Sivathanu, Chennai (IN); Srinidhi Viswanatha, Bangalore (IN); Dharma Kiritkumar Shukla, Bellevue, WA (US); Nipun Kwatra, Bangalore (IN); Ramachandran Ramjee, Bengaluru (IN); Rimma Vladimirovna Nehme, Bellevue, WA (US); Pankaj Sharma, Redmond, WA (US); Bhalakumaaran Erode Ranganathan, Bellevue, WA (US); and Vaibhav Sharma, Seattle, WA (US)
Assigned to Microsoft Technology Licensing, LLC., Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jun. 26, 2021, as Appl. No. 17/359,553.
Claims priority of application No. 202141013182 (IN), filed on Mar. 25, 2021.
Prior Publication US 2022/0308917 A1, Sep. 29, 2022
Int. Cl. G06F 9/48 (2006.01); G06F 9/46 (2006.01); G06F 9/54 (2006.01); G06N 3/08 (2023.01); G06T 1/20 (2006.01); G06T 1/60 (2006.01); H04L 67/568 (2022.01)

CPC G06F 9/4856 (2013.01) [G06F 9/461 (2013.01); G06F 9/54 (2013.01); G06N 3/08 (2013.01); G06T 1/20 (2013.01); G06T 1/60 (2013.01); H04L 67/568 (2022.05)]

19 Claims

1. A method for providing checkpointing of a deep learning training (DLT) job, at one node in a cloud computing environment and resuming the DLT job from a checkpointed state on a different node, the method comprising:

capturing a graphics processing unit (GPU) state of a GPU executing on the DLT job, wherein the GPU state includes GPU data comprising model parameters and an optimizer state located in the GPU at a time of checkpointing;

capturing a central processing unit (CPU) state of a CPU executing on the DLT job;

migrating the DLT job to the different node at the checkpointed state using the GPU state and the CPU state;

initiating resumption of processing of the DLT job from the checkpointed state on the different node;

isolating a GPU-related activity into a separate proxy process that has a different address space than the GPU; and

computing the DLT job in a main process associated with the CPU,

wherein the proxy process is stateless across checkpoints, isolating a temporary GPU-related mapping to the address space of the proxy process.