US 12,468,924 B2
Parallel computing scheme generation for neural networks
Chong Li, Boulogne Billancourt (FR); Thibaut Tachon, Boulogne Billancourt (FR); Hongxing Wang, Shenzhen (CN); Kelun Chai, Boulogne Billancourt (FR); and Chang Liu, Shenzhen (CN)
Assigned to Huawei Technologies Co., Ltd., Shenzhen (CN)
Filed by HUAWEI TECHNOLOGIES CO., LTD., Shenzhen (CN)
Filed on Sep. 27, 2022, as Appl. No. 17/953,991.
Application 17/953,991 is a continuation of application No. PCT/EP2020/058707, filed on Mar. 27, 2020.
Prior Publication US 2023/0024350 A1, Jan. 26, 2023
Int. Cl. G06N 3/06 (2006.01)
CPC G06N 3/06 (2013.01) 15 Claims
OG exemplary drawing
 
1. A device for determining a parallel computation scheme for a neural network, the device comprising at least one processor configured to:
receive a computation graph for the neural network;
transform the computation graph into a recursive dataflow graph comprising a plurality of recursive subgraphs, wherein each of the recursive subgraphs is respectively a tuple of another of the recursive subgraphs and an operator node;
determine a number of partitioning recursions based on a number of parallel computing devices;
for each of the partitioning recursions:
determine a plurality of costs corresponding to a plurality of operator nodes associated with the recursive dataflow graph,
determine a processing order of the plurality of recursive subgraphs based on a descending order of the plurality of costs,
process the plurality of recursive subgraphs in the determined processing order, wherein processing a recursive subgraph, of the plurality of recursive subgraphs, comprises selecting a partitioning axis for tensors associated with an operator node of the recursive subgraph;
output a partitioning scheme comprising partitioning axes for each of the tensors associated with the plurality of operator nodes; and
wherein to select the partitioning axis for the tensors associated with the operator node based on an inter-operator communication cost comprising an amount of data to be communicated between the parallel computing devices for executing a neighboring operator node based on a shared tensor between the operator node and the neighboring operator node or for executing the operator node based on an output of the neighboring operator node.