| CPC G06F 9/30145 (2013.01) [G06F 9/30032 (2013.01); G06F 9/30036 (2013.01); G06F 9/30038 (2023.08); G06F 9/30109 (2013.01)] | 21 Claims |

|
1. A processor comprising:
a plurality of vector registers, each vector register of the plurality of vector registers to store a plurality of matrix data elements;
execution circuitry to execute an instruction to load a first plurality of source data elements from a first column and a second column of a first source tile to a first vector register of the plurality of vector registers, wherein a first subset of the first plurality of source data elements from the first column are to be interleaved with a second subset of the first plurality of source data elements from the second column within the first vector register, the first source tile comprising group of rows and columns of a first source matrix stored in a memory;
the execution circuitry further comprising:
a set of multipliers to perform a parallel multiplication of each data element of the first plurality of source data elements stored in the first vector register with a corresponding source data element of a second plurality of source data elements stored in a second vector register of the plurality of vector registers to generate a corresponding plurality of products, the second plurality of source data elements from a second source tile of a second source matrix to be multiplied with the first source matrix; and
accumulator circuitry to add groups of the corresponding plurality of products to corresponding accumulated data elements of an accumulation matrix to generate corresponding result data elements of a result matrix.
|