US 12,443,825 B2
Neural processing unit and method of operation thereof
Jung Boo Park, Seoul (KR)
Assigned to DEEPX CO., LTD., Seongnam-si (KR)
Filed by DEEPX CO., LTD., Seongnam-si (KR)
Filed on Jan. 18, 2024, as Appl. No. 18/415,684.
Application 18/415,684 is a continuation of application No. PCT/KR2023/009803, filed on Jul. 10, 2023.
Claims priority of application No. 10-2022-0084611 (KR), filed on Jul. 8, 2022.
Prior Publication US 2024/0152738 A1, May 9, 2024
Int. Cl. G06N 3/063 (2023.01); G06N 3/045 (2023.01); G06N 3/0464 (2023.01); G06N 3/06 (2006.01); G06T 1/20 (2006.01)
CPC G06N 3/0464 (2023.01) [G06T 1/20 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method of operating a neural processing unit having a systolic array structure, the method comprising:
determining, by a controller, that an operation performed in a first convolution layer is a transpose convolution operation;
dividing, by the controller, a kernel used for the transpose convolution operation into a plurality of sub-kernels; and
performing a convolution operation between an input feature map and each of the plurality of sub-kernels in the first convolution layer, the convolution operation performed by each of a plurality of processing elements,
wherein each of the plurality of processing elements is configured to perform a process of reusing at least one of an output feature map, each of the plurality of sub-kernels, and the input feature map, which are values stored in a local memory of each of the plurality of processing elements, and
wherein the reusing process is performed by a first processing element of the plurality of processing elements transferring the stored values of the at least one of the output feature map, each of the plurality of sub-kernels, and the output feature map to a second processing element of the plurality of processing elements,
wherein the systolic array structure includes a plurality of structures arranged, in parallel, in correspondence to the values stored in the local memory, the stored values being used in successive convolution operations,
wherein a multiply-and-accumulate (MAC) operation mode of the plurality of processing elements corresponds to one of the plurality of structures and is switched based on a calculated MAC operation time, and
wherein the MAC operation mode includes an output stationary mode where the output feature map is reused, a weight stationary mode where each of the plurality of sub-kernels is reused, and an input stationary mode where the input feature map is reused.