US 11,893,424 B2
Training a neural network using a non-homogenous set of reconfigurable processors
Martin Russell Raumann, San Leandro, CA (US); Qi Zheng, Fremont, CA (US); Bandish B. Shah, San Francisco, CA (US); Ravinder Kumar, Fremont, CA (US); Kin Hing Leung, Cupertino, CA (US); Sumti Jairath, Santa Clara, CA (US); and Gregory Frederick Grohoski, Bee Cave, TX (US)
Assigned to SambaNova Systems, Inc., Palo Alto, CA (US)
Filed by SambaNova Systems, Inc., Palo Alto, CA (US)
Filed on Jan. 24, 2022, as Appl. No. 17/582,925.
Application 17/582,925 is a continuation of application No. 17/379,924, filed on Jul. 19, 2021, granted, now 11,237,880.
Application 17/379,924 is a continuation in part of application No. 17/127,929, filed on Dec. 18, 2020, granted, now 11,182,221, issued on Nov. 23, 2021.
Prior Publication US 2022/0197714 A1, Jun. 23, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/50 (2006.01); G06F 9/455 (2018.01); G06N 3/02 (2006.01)
CPC G06F 9/5077 (2013.01) [G06F 9/45558 (2013.01); G06F 9/5027 (2013.01); G06N 3/02 (2013.01); G06F 2009/4557 (2013.01)] 11 Claims
OG exemplary drawing
 
1. A system for training parameters of a neural network using training data samples that include a plurality of input values and a target output value, comprising:
a plurality of processing nodes including a processing node and a second processing node, each processing node of the plurality of processing nodes includes:
a respective plurality of processors reconfigurable at a first level of configuration granularity;
a respective smart network interface controller comprising a respective controller configurable at a second level of configuration granularity; and
a respective interconnect fabric communicatively coupling processors of the respective plurality of processors to the respective smart network interface controller;
the processing node comprising a plurality of processors reconfigurable at the first level of configuration granularity, including a processor and a smart network interface controller comprising a controller reconfigurable at the second level of configuration granularity finer than the first level of configuration granularity, the controller communicatively coupled to the processor through an intra-node communication link, and to a network interface controller on the second processing node through a network;
the processor configured by a host system to execute a first dataflow segment of the neural network having one or more dataflow pipelines to generate a predicted output value using a first subset of the plurality of input values and a set of neural network parameters, calculate a first intermediate result for a parameter of the set of neural network parameters based on the predicted output value and the target output value, and provide the first intermediate result to the controller;
wherein the processor comprising a Coarse-Grained Reconfigurable Architecture (CGRA); and
the controller configured by the host system to receive a second intermediate result from the network interface controller though the network, execute a second dataflow segment, dependent upon the first intermediate result and the second intermediate result, to generate a third intermediate result indicative of an update of the parameter of the set of neural network parameters;
wherein the controller is configured to produce the third intermediate result using a collective methodology;
wherein the respective smart network interface controllers of the plurality of processing nodes are connected in a ring configuration and the collective methodology comprises a uni-directional ring all-reduce; and
wherein none of the first intermediate result, the second intermediate result, nor the third intermediate result, pass through the host system.