US 12,487,965 B2
	All reduce across multiple reconfigurable dataflow processors
Mingran Wang, San Jose, CA (US)
Assigned to SambaNova Systems, Inc., Palo Alto, CA (US)
Filed by SambaNova Systems, Inc., Palo Alto, CA (US)
Filed on Jun. 9, 2023, as Appl. No. 18/208,048.
Claims priority of provisional application 63/350,862, filed on Jun. 9, 2022.
Prior Publication US 2023/0409520 A1, Dec. 21, 2023
Int. Cl. G06F 9/44 (2018.01); G06F 8/41 (2018.01); G06F 15/173 (2006.01); G06F 15/82 (2006.01); G06F 17/16 (2006.01)

CPC G06F 15/825 (2013.01) [G06F 8/433 (2013.01); G06F 8/4441 (2013.01); G06F 15/17375 (2013.01); G06F 17/16 (2013.01)]

19 Claims

1. A computing system, the system comprising:

a host computer comprising a graph optimization module configured to conduct a method comprising:

receiving a compute graph for execution on multiple reconfigurable dataflow processors RDPs, the multiple RDPs being interconnected with a ring network, the ring network having R interconnected RDPs, including a first RDP, and a second RDP adjacent to the first RDP in the ring network, wherein R is an integer value;

detecting a node of the compute graph that specifies a reduction operation for a first tensor and a second tensor;

partitioning the node of the compute graph into a compute subgraph corresponding to the first RDP;

inserting a first inserted node into the compute subgraph that specifies a partial reduction operation for producing a partial reduction result corresponding to a shard of the first tensor and a shard of the second tensor;

inserting a second inserted node into the compute subgraph for communicating the partial reduction result to the second RDP;

inserting a third inserted node into the compute subgraph that specifies a reduction operation for producing a total reduction result for the first tensor and the second tensor; and

inserting a fourth inserted node into the compute subgraph for communicating the total reduction result to the first RDP.