| CPC G06F 9/30036 (2013.01) [G06F 9/3001 (2013.01); G06F 9/30038 (2023.08); G06F 9/3016 (2013.01); G06F 9/3802 (2013.01)] | 20 Claims |

|
1. A processor comprising:
an instruction cache to store a matrix multiplication instruction;
a plurality of vector registers to store a plurality of single-precision floating point source data elements of a first matrix comprising M rows by N columns, a first plurality of bfloat16 floating-point source data elements of a second matrix comprising M rows and K columns, and a second plurality of bfloat16 floating point source data elements of a third matrix comprising K rows by N columns;
decode circuitry to decode the matrix multiplication instruction, the matrix multiplication instruction including an opcode to indicate a matrix multiplication operation, a first field to indicate a first storage location associated with the plurality of single-precision floating point source data elements, a second field to indicate a second storage location associated with the first plurality of bfloat16 floating-point source data elements, and a third field to indicate a third storage location associated with the second plurality of bfloat16 floating point source data elements, wherein the first, second, and third storage locations are locations in the plurality of vector registers; and
execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations in accordance with the matrix multiplication instruction, the execution circuitry to, for each row m of the M rows of the second matrix and each column n of the N columns of the third matrix:
generate a dot product from K bfloat16 floating-point source data elements corresponding to the row m of the second matrix and K bfloat16 floating-point source data elements corresponding to the column n of the third matrix, and
accumulate the dot product with a single precision floating-point source data element corresponding to a row m of the M rows and a column n of the N columns of the first matrix to generate a single-precision floating-point result data element to be stored in a position of the plurality of vector registers corresponding to the row m and the column n of the first matrix.
|