US 12,287,702 B2
Fault management in a reconfigurable dataflow architecture
Raghunath Shenbagam, San Jose, CA (US); Ranen Chatterjee, Palo Alto, CA (US); Anand Misra, Palo Alto, CA (US); Jim Lewis, Palo Alto, CA (US); Benjamin Glick, Palo Alto, CA (US); Pushkar Nandkar, Palo Alto, CA (US); and Sruthi Veeragandham, Palo Alto, CA (US)
Assigned to SambaNova Systems, Inc., Palo Alto, CA (US)
Filed by SambaNova Systems, Inc., Palo Alto, CA (US)
Filed on Feb. 3, 2023, as Appl. No. 18/105,777.
Prior Publication US 2024/0264896 A1, Aug. 8, 2024
Int. Cl. G06F 11/07 (2006.01)
CPC G06F 11/0793 (2013.01) [G06F 11/0721 (2013.01); G06F 11/0769 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
receiving, by one or more coarse grained reconfigurable processors, one or more fault events associated with a reconfigurable data flow unit (RDU) component in a system;
determining, by the one or more coarse grained reconfigurable processors and based on an inventory database, a component included in the RDU component that is associated with the one or more fault events;
creating, by the one or more coarse grained reconfigurable processors and based at least in part on the one or more fault events, an error report, the error report comprising:
an error type identifying a type of error associated with the one or more fault events;
a timestamp indicating when the error report was created; and
a universal unique identifier (UUID) to uniquely identify the error report;
determining, by the one or more coarse grained reconfigurable processors and based at least in part on the error report, a policy associated with the one or more fault events;
classifying, by the one or more coarse grained reconfigurable processors and based at least in part on the policy, the one or more fault events as either a threshold event or a discrete event; and
performing, by the one or more coarse grained reconfigurable processors, one or more actions to address the one or more fault events; and
notifying an application operated by the RDU of occurrence of the one or more fault events, the classification of the one or more fault events, and the actions taken performed to address the one or more fault events.