US 12,437,199 B2
	Activation compression method for deep learning acceleration
Zhi-Gang Liu, Westford, MA (US); and Matthew Mattina, Boylston, MA (US)
Assigned to Arm Limited, Cambridge, MA (US)
Filed by Arm Limited, Cambridge (GB)
Filed on Jan. 25, 2021, as Appl. No. 17/157,319.
Claims priority of provisional application 63/117,728, filed on Nov. 24, 2020.
Prior Publication US 2022/0164663 A1, May 26, 2022
Int. Cl. G06N 3/082 (2023.01); G06F 7/523 (2006.01); G06F 9/50 (2006.01); G06F 17/16 (2006.01); G06N 3/084 (2023.01)

CPC G06N 3/082 (2013.01) [G06F 7/523 (2013.01); G06F 9/5027 (2013.01); G06F 17/16 (2013.01); G06N 3/084 (2013.01)]

17 Claims

1. A system, comprising:

a processor, coupled to a memory, configured to:

generate, based on an input tensor, a number of basic block matrices, each basic block matrix including a number of elements, the input tensor being an input feature map to a convolutional layer of a plurality of convolutional layers of a neural network and the elements of the basic block matrix being activation values,

generate a sequence of compressed basic block matrices including, for each basic block matrix in a forward path of the convolutional layer:

dynamically prune the elements of the basic block matrix, the dynamic pruning of the elements of the basic block matrix including selecting a number k of the largest activation values of the basic block matrix based on the magnitude of the activation values and a sparsity value,

generate a mask for the basic block matrix, each mask including a number of bits, each bit in each mask having a first value when a corresponding activation value of the basic block matrix is one of the k largest activation values of basic block matrix, and having a second value when a corresponding activation value of the basic block matrix is not one of the k largest activation values of the basic block matrix, and

compress the basic block matrix to generate a compressed basic block matrix containing the k largest activation values of the basic block matrix; and

re-sequence each row of a weight matrix into a sequence of weight groups based on the sequences of compressed basic block matrices;

a matrix multiply accelerator (MMA), coupled to the processor and the memory, configured to:

multiply, based on the masks, the compressed basic block matrices and a weight matrix to generate an output matrix as an output feature map of the convolutional layer,

where the MMA includes:

a first register configured to store the masks and elements of a compressed basic block matrix;

a second register configured to store elements of the weight matrix;

a third register configured to store the output matrix;

an array of processing elements (PEs), coupled to the first, second and third registers, each PE including:

a first multiplexer configured to receive a weight group within the sequence of weight groups, and selectively output a first weight based on a first data selection signal;

a second multiplexer configured to receive the weight group within the sequence of weight groups, and selectively output a second weight based on a second data selection signal;

a data selection circuit, coupled to the first and second multiplexers, configured to receive a mask corresponding to a compressed basic block matrix within the sequence of compressed basic block matrices and generate the first and second data selection signals based on the mask;

a first multiplier circuit, coupled to the first multiplexer, configured to receive a first element from the compressed basic block matrix and the first weight selectively output by the first multiplexer, multiply the first element and the first weight to generate a first intermediate product, and output the first intermediate product;

a second multiplier circuit, coupled to the second multiplexer, configured to receive a second element from the compressed basic block matrix and the second weight selectively output by the second multiplexer, multiply the second element and the second weight to generate a second intermediate product, and output the second intermediate product; and

an accumulator circuit, coupled to the first and second multiplier circuits, configured to receive the first and second intermediate products and accumulate the first and second intermediate products into a value for one element of the output matrix.