US 11,928,176 B2
Time domain unrolling sparse matrix multiplication system and method
Zhi-Gang Liu, Westford, MA (US); Paul Nicholas Whatmough, Cambridge, MA (US); and Matthew Mattina, Boylston, MA (US)
Assigned to Arm Limited, Cambridge (GB)
Filed by Arm Limited, Cambridge (GB)
Filed on Nov. 24, 2020, as Appl. No. 17/103,676.
Claims priority of provisional application 63/058,850, filed on Jul. 30, 2020.
Prior Publication US 2022/0035890 A1, Feb. 3, 2022
Int. Cl. G06F 17/16 (2006.01); G06F 7/544 (2006.01); G06F 9/38 (2018.01); G06F 15/80 (2006.01)
CPC G06F 17/16 (2013.01) [G06F 7/5443 (2013.01); G06F 15/80 (2013.01); G06F 9/3893 (2013.01)] 14 Claims
OG exemplary drawing
 
1. A system, comprising:
a processor coupled to a memory; and
a matrix multiply accelerator (MMA) having an array of processing engines (PEs) coupled to the processor, configured to:
multiply, based on a bitmap or index, a compressed first matrix and selected elements of a second matrix to generate an output matrix over a number of calculation cycles, the bitmap or index relating elements of the compressed first matrix to corresponding elements of a first matrix from which the compressed first matrix is derived, said multiply including:
for each element i,j of the output matrix, calculate a dot product of an ith row of the compressed first matrix and selected elements of a jth column of the second matrix based on the bitmap or index; or
multiply, based on the bitmap or index, selected elements of the second matrix and the compressed first matrix and to generate the output matrix over a number of calculation cycles, said multiply including:
for each element i,j of the output matrix, calculate a dot product of selected elements of an ith row of the second matrix and a jth column of the compressed first matrix based on the bitmap or index,
where a PE of the array of PEs is coupled to a first vector register storing a row or column of the compressed first matrix as a first compressed vector and a second vector register storing a column or row of the second matrix as a second vector, the PE including a single multiplier circuit, a multiplexer circuit and an accumulator circuit and configured to:
for each calculation cycle:
select, by the multiplexer circuit, an element of the second vector from the second vector register based on the bitmap or index;
multiply, by the multiplier circuit, an element of the first compressed vector and the selected element of the second vector to produce an intermediate product; and
accumulate, by the accumulator circuit, the intermediate product to update a dot product of the first compressed vector; and selected elements of the second vector; and
store the dot product in an output register as element i,j of the output matrix,
where each calculation cycle consumes one element of the first compressed vector.