US 11,874,742 B2
Techniques for recovering from errors when executing software applications on parallel processors
Saurabh Hukerikar, Santa Clara, CA (US); and Nirmal Raj Saxena, Los Altos Hills, CA (US)
Assigned to NVIDIA CORPORATION, Santa Clara, CA (US)
Filed by NVIDIA CORPORATION, Santa Clara, CA (US)
Filed on Apr. 22, 2021, as Appl. No. 17/237,376.
Prior Publication US 2022/0342761 A1, Oct. 27, 2022
Int. Cl. G06F 11/14 (2006.01); G06F 9/38 (2018.01); G06F 9/30 (2018.01); G06F 11/07 (2006.01)
CPC G06F 11/1407 (2013.01) [G06F 9/30101 (2013.01); G06F 9/3861 (2013.01); G06F 11/0772 (2013.01); G06F 11/1438 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for checkpointing a context associated with an execution of a software application on a parallel processor, the method comprising:
determining that a kernel executing on a plurality of parallel processing elements included in the parallel processor is tagged to indicate that the kernel is enabled for intra-kernel checkpointing and restart;
causing the plurality of parallel processing elements to stop executing a first plurality of instructions included in the kernel in accordance with the context before executing a next instruction included in the first plurality of instructions;
causing the parallel processor to collect first state data associated with the context;
generating a checkpoint based on the first state data, wherein the checkpoint is stored in a memory associated with the parallel processor; and
causing the plurality of parallel processing elements to resume executing the first plurality of instructions included in the kernel at the next instruction in accordance with the context.