CPC G06F 9/4856 (2013.01) [G06F 9/461 (2013.01); G06F 9/54 (2013.01); G06N 3/08 (2013.01); G06T 1/20 (2013.01); G06T 1/60 (2013.01); H04L 67/568 (2022.05)] | 19 Claims |
1. A method for providing checkpointing of a deep learning training (DLT) job, at one node in a cloud computing environment and resuming the DLT job from a checkpointed state on a different node, the method comprising:
capturing a graphics processing unit (GPU) state of a GPU executing on the DLT job, wherein the GPU state includes GPU data comprising model parameters and an optimizer state located in the GPU at a time of checkpointing;
capturing a central processing unit (CPU) state of a CPU executing on the DLT job;
migrating the DLT job to the different node at the checkpointed state using the GPU state and the CPU state;
initiating resumption of processing of the DLT job from the checkpointed state on the different node;
isolating a GPU-related activity into a separate proxy process that has a different address space than the GPU; and
computing the DLT job in a main process associated with the CPU,
wherein the proxy process is stateless across checkpoints, isolating a temporary GPU-related mapping to the address space of the proxy process.
|