| CPC G06N 3/082 (2013.01) [G06F 7/523 (2013.01); G06F 9/5027 (2013.01); G06F 17/16 (2013.01); G06N 3/084 (2013.01)] | 17 Claims |

|
1. A system, comprising:
a processor, coupled to a memory, configured to:
generate, based on an input tensor, a number of basic block matrices, each basic block matrix including a number of elements, the input tensor being an input feature map to a convolutional layer of a plurality of convolutional layers of a neural network and the elements of the basic block matrix being activation values,
generate a sequence of compressed basic block matrices including, for each basic block matrix in a forward path of the convolutional layer:
dynamically prune the elements of the basic block matrix, the dynamic pruning of the elements of the basic block matrix including selecting a number k of the largest activation values of the basic block matrix based on the magnitude of the activation values and a sparsity value,
generate a mask for the basic block matrix, each mask including a number of bits, each bit in each mask having a first value when a corresponding activation value of the basic block matrix is one of the k largest activation values of basic block matrix, and having a second value when a corresponding activation value of the basic block matrix is not one of the k largest activation values of the basic block matrix, and
compress the basic block matrix to generate a compressed basic block matrix containing the k largest activation values of the basic block matrix; and
re-sequence each row of a weight matrix into a sequence of weight groups based on the sequences of compressed basic block matrices;
a matrix multiply accelerator (MMA), coupled to the processor and the memory, configured to:
multiply, based on the masks, the compressed basic block matrices and a weight matrix to generate an output matrix as an output feature map of the convolutional layer,
where the MMA includes:
a first register configured to store the masks and elements of a compressed basic block matrix;
a second register configured to store elements of the weight matrix;
a third register configured to store the output matrix;
an array of processing elements (PEs), coupled to the first, second and third registers, each PE including:
a first multiplexer configured to receive a weight group within the sequence of weight groups, and selectively output a first weight based on a first data selection signal;
a second multiplexer configured to receive the weight group within the sequence of weight groups, and selectively output a second weight based on a second data selection signal;
a data selection circuit, coupled to the first and second multiplexers, configured to receive a mask corresponding to a compressed basic block matrix within the sequence of compressed basic block matrices and generate the first and second data selection signals based on the mask;
a first multiplier circuit, coupled to the first multiplexer, configured to receive a first element from the compressed basic block matrix and the first weight selectively output by the first multiplexer, multiply the first element and the first weight to generate a first intermediate product, and output the first intermediate product;
a second multiplier circuit, coupled to the second multiplexer, configured to receive a second element from the compressed basic block matrix and the second weight selectively output by the second multiplexer, multiply the second element and the second weight to generate a second intermediate product, and output the second intermediate product; and
an accumulator circuit, coupled to the first and second multiplier circuits, configured to receive the first and second intermediate products and accumulate the first and second intermediate products into a value for one element of the output matrix.
|