| CPC G06F 9/5027 (2013.01) [G06F 15/80 (2013.01)] | 20 Claims |

|
1. A method for implementing a machine learning network (MLN) on a machine learning accelerator (MLA), the MLA implemented on a semiconductor die and comprising a plurality of hardware processing elements (PEs) connected by data transfer paths, the method comprising:
receiving a description of the MLN, the description including calculation of an output tensor as a weighted sum of an input tensor;
partitioning the input tensor into input slices and allocating each input slice to one of the PEs to form logical rows of PEs that correspond to either rows or columns of the input slices in the input tensor, wherein the PEs in each logical row are connected by data transfer paths and the logical rows are also connected by data transfer paths;
partitioning the output tensor into output slices and allocating each output slice to one of the PEs to form logical rows of PEs that correspond to either rows or columns of the output slices in the output tensor, wherein the PEs in each logical row are connected by data transfer paths and the logical rows are also connected by data transfer paths, and each output slice is calculated as a weighted sum of a support of input slices;
executing a set of instructions that implement calculation of the output tensor on the MLA, comprising:
for output slices that have supporting input slices in a same logical row:
executing concurrent instructions for intra-row shifts to transfer data from the supporting input slices to the output slices; and
for output slices that have supporting input slices in a different logical row:
executing concurrent instructions for inter-row transfers to transfer data from the supporting input slices to the output slices.
|