US 11,722,573 B2
Artificial intelligence workload migration for planet-scale artificial intelligence infrastructure service
Dharma Kiritkumar Shukla, Bellevue, WA (US); Muthian Sivathanu, Chennai (IN); Lu Xun, Redmond, WA (US); and Rimma Vladimirovna Nehme, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jun. 25, 2021, as Appl. No. 17/359,471.
Claims priority of application No. 202141013580 (IN), filed on Mar. 26, 2021.
Prior Publication US 2022/0311832 A1, Sep. 29, 2022
Int. Cl. H04L 67/148 (2022.01); G06N 3/08 (2023.01); G06F 9/48 (2006.01); G06T 1/20 (2006.01); H04L 67/10 (2022.01)
CPC H04L 67/148 (2013.01) [G06F 9/4856 (2013.01); G06N 3/08 (2013.01); G06T 1/20 (2013.01); H04L 67/10 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
capturing a graphics processing unit (GPU) state of a GPU executing a deep learning training (DLT) job at a source node of a cloud computing environment, wherein the GPU state includes GPU data comprising model parameters located in the GPU at a time of checkpointing;
capturing a central processing unit (CPU) state of a CPU executing the DLT job;
generating a distributed snapshot of all workers associated with the DLT job;
migrating the DLT job to a destination node at a checkpointed state using the GPU state and the CPU state according to the distributed snapshot, wherein the destination node is different than the source node; and
initiating resumption of processing of the DLT job from the checkpointed state on the destination node.