CPC G06N 3/063 (2013.01) [G06F 8/41 (2013.01)] | 18 Claims |
1. A system, comprising:
an inference accelerator comprising a plurality of tensor processing units with respective on-board memories to implement respective state buffers for respective systolic arrays;
at least one processor;
a memory storing program instructions that when executed by the at least one processor cause the at least one processor to implement a neural network compiler, wherein the neural network compiler is configured to:
receive a neural network comprising a plurality of layers that comprise respective operations for execution across the plurality of tensor processing units of the inference accelerator;
access a configuration of the plurality of tensor processing units with the respective on-board memories;
based on the configuration, determine respective capacities of respectively dedicated caches for individual ones of the plurality of tensor processing units implemented in the respective on-board memories; compile the neural network by:
dividing the respective operations of the plurality of layers into different subgraphs according to a partitioning scheme, wherein the dividing comprises:
determining a number of subgraphs greater than a number of the plurality of tensor processing units according to a non-contiguous partitioning scheme for dividing the respective operations of the plurality of layers into the different subgraphs;
evaluating different possible groupings of the different subgraphs, wherein a first one of the possible groupings includes two or more of the different subgraphs that are non-contiguous, wherein a second one of the possible groupings includes at least one of the two—or two or more different subgraphs with another one of the different subgraphs that are contiguous; and
selecting a grouping of the different possible groupings of the different subgraphs according to a balance of features of the selected grouping of the different subgraphs among the plurality of tensor processing units determined based on the evaluating; and
assigning the different subgraphs to different ones of the plurality of tensor processing units according to the selected grouping of the different subgraphs; and
including, in the instructions, static allocations of portions within the determined respective capacities of the respectively dedicated caches, wherein the static allocations of portions:
instruct weight values for the respective operations of the subgraphs assigned to the plurality of tensor processing units to be loaded from a memory to the static allocations of allocated portions of the respectively dedicated caches as part of executing the neural network; and
identify the static allocations of portions as read-only to prevent the weight values from being removed from or overwritten in the static allocations of portions, wherein the static allocations of portions save storage space in the respectively dedicated caches for loading other information used to execute the subgraph.
|