CPC H04L 67/148 (2013.01) [G06F 9/4856 (2013.01); G06N 3/08 (2013.01); G06T 1/20 (2013.01); H04L 67/10 (2013.01)] | 20 Claims |
1. A computer-implemented method comprising:
capturing a graphics processing unit (GPU) state of a GPU executing a deep learning training (DLT) job at a source node of a cloud computing environment, wherein the GPU state includes GPU data comprising model parameters located in the GPU at a time of checkpointing;
capturing a central processing unit (CPU) state of a CPU executing the DLT job;
generating a distributed snapshot of all workers associated with the DLT job;
migrating the DLT job to a destination node at a checkpointed state using the GPU state and the CPU state according to the distributed snapshot, wherein the destination node is different than the source node; and
initiating resumption of processing of the DLT job from the checkpointed state on the destination node.
|