US 11,720,440 B2
Error containment for enabling local checkpoint and recovery
Naveen Cherukuri, San Jose, CA (US); Saurabh Hukerikar, Santa Clara, CA (US); Paul Racunas, Landaff, NH (US); Nirmal Raj Saxena, Los Altos Hills, CA (US); David Charles Patrick, Madison, AL (US); Yiyang Feng, San Jose, CA (US); Abhijeet Ghadge, San Jose, CA (US); Steven James Heinrich, Madison, AL (US); Adam Hendrickson, San Jose, CA (US); Gentaro Hirota, Sunnyvale, CA (US); Praveen Joginipally, San Jose, CA (US); Vaishali Kulkarni, Sunnyvale, CA (US); Peter C. Mills, San Jose, CA (US); Sandeep Navada, San Jose, CA (US); Manan Patel, San Jose, CA (US); and Liang Yin, San Jose, CA (US)
Assigned to NVIDIA CORPORATION, Santa Clara, CA (US)
Filed by NVIDIA CORPORATION, Santa Clara, CA (US)
Filed on Jul. 12, 2021, as Appl. No. 17/373,678.
Prior Publication US 2023/0011863 A1, Jan. 12, 2023
Int. Cl. G06F 11/07 (2006.01); G06F 11/10 (2006.01); G06F 12/1018 (2016.01); G06F 11/14 (2006.01); G06F 12/1027 (2016.01)
CPC G06F 11/1016 (2013.01) [G06F 11/0772 (2013.01); G06F 11/0793 (2013.01); G06F 11/1407 (2013.01); G06F 12/1018 (2013.01); G06F 12/1027 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A computer-implemented method for processing a memory error, the method comprising:
causing a first instruction that includes a memory load operation to be executed by a first memory client included in a plurality of memory clients;
receiving an indication that data associated with the memory load operation is corrupt; and
in response to receiving the indication:
disabling the first memory client from performing memory operations directed towards a shared resource, and
initiating one or more stall operations for the first memory client,
wherein a second memory client included in the plurality of memory clients continues to execute instructions that perform memory operations directed towards the shared resource while the first memory client is disabled.