US 12,260,253 B2
	Layout-based data transfer between synchronized, interconnected processing elements for implementing machine learning networks
Gwangho Kim, San Jose, CA (US)
Assigned to SiMa Technologies, Inc., San Jose, CA (US)
Filed by SiMa Technologies, Inc., San Jose, CA (US)
Filed on Jan. 23, 2023, as Appl. No. 18/158,447.
Prior Publication US 2024/0248760 A1, Jul. 25, 2024
Int. Cl. G06F 9/44 (2018.01); G06F 9/50 (2006.01); G06F 15/80 (2006.01)

CPC G06F 9/5027 (2013.01) [G06F 15/80 (2013.01)]

20 Claims

1. A method for implementing a machine learning network (MLN) on a machine learning accelerator (MLA), the MLA implemented on a semiconductor die and comprising a plurality of hardware processing elements (PEs) connected by data transfer paths, the method comprising:

receiving a description of the MLN, the description including calculation of an output tensor as a weighted sum of an input tensor;

partitioning the input tensor into input slices and allocating each input slice to one of the PEs to form logical rows of PEs that correspond to either rows or columns of the input slices in the input tensor, wherein the PEs in each logical row are connected by data transfer paths and the logical rows are also connected by data transfer paths;

partitioning the output tensor into output slices and allocating each output slice to one of the PEs to form logical rows of PEs that correspond to either rows or columns of the output slices in the output tensor, wherein the PEs in each logical row are connected by data transfer paths and the logical rows are also connected by data transfer paths, and each output slice is calculated as a weighted sum of a support of input slices;

executing a set of instructions that implement calculation of the output tensor on the MLA, comprising:

for output slices that have supporting input slices in a same logical row:

executing concurrent instructions for intra-row shifts to transfer data from the supporting input slices to the output slices; and

for output slices that have supporting input slices in a different logical row:

executing concurrent instructions for inter-row transfers to transfer data from the supporting input slices to the output slices.