US 11,907,827 B2
Schedule-aware tensor distribution module
Gautham Chinya, Sunnyvale, CA (US); Huichu Liu, Santa Clara, CA (US); Arnab Raha, Santa Clara, CA (US); Debabrata Mohapatra, Santa Clara, CA (US); Cormac Brick, San Francisco, CA (US); and Lance Hacking, Spanish Fork, UT (US)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Jun. 28, 2019, as Appl. No. 16/456,707.
Prior Publication US 2020/0410327 A1, Dec. 31, 2020
Int. Cl. G06N 3/063 (2023.01); G06N 5/04 (2023.01); G06F 9/448 (2018.01); G06F 9/38 (2018.01); G06F 9/50 (2006.01)
CPC G06N 3/063 (2013.01) [G06F 9/3814 (2013.01); G06F 9/3877 (2013.01); G06F 9/4498 (2018.02); G06F 9/5027 (2013.01); G06N 5/04 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A neural network accelerator, comprising:
a memory;
a plurality of processing engines coupled together and configured to perform arithmetic operations in support of an inference performed using the neural network accelerator, wherein the plurality of processing engines are implemented using a plurality of processing elements; and
a schedule-aware tensor data distribution circuitry configured to:
load tensor data into the plurality of processing engines in a load phase;
extract output data from the plurality of processing engines in an extraction phase, wherein extracting the output data from the plurality of processing engines may be performed in a row-wise or column-wise organization;
reorganize the extracted output data based at least in part on a schedule for a next layer after a current layer to output the extracted output data by reshaping the extracted output data for storage in the memory to reduce a number of accesses to the memory for the next layer by changing a shape of how the extracted output data is stored in the memory based at least in part on a type of the next layer, wherein changing the shape comprises reorganizing the extracted output data to a column-wise organization from the row-wise organization or to the row-wise organization from the column-wise organization based on a specification of the next layer; and
store the reorganized extracted output data to the memory.