CPC G06F 9/30036 (2013.01) [G06F 9/3001 (2013.01); G06F 9/3016 (2013.01); G06F 9/3802 (2013.01)] | 24 Claims |
1. A processor comprising:
a general-purpose central processing unit (CPU) core, comprising:
a control register to specify a round mode;
fetch circuitry to fetch an instruction;
decode circuitry to decode the instruction, the instruction having a first field to specify a first storage location of a plurality of data elements corresponding to a first matrix having M rows by N columns of 32-bit single precision floating-point data elements, a second field to specify a second storage location of a plurality of data elements corresponding to a second matrix having M rows by K columns of 16-bit floating-point data elements having a bfloat16 format, and a third field to specify a third storage location of a plurality of data elements corresponding to a third matrix having K rows by N columns of 16-bit floating-point data elements having the bfloat16 format; and
execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations corresponding to the instruction to, for each row m of the M rows of the second matrix, and for each column n of the N columns of the third matrix:
convert K 16-bit floating-point data elements corresponding to the row m of the second matrix to K 32-bit floating-point data elements, and convert K 16-bit floating-point data elements corresponding to the column n of the third matrix to K 32-bit floating-point data elements;
generate two dot products from the K 32-bit floating-point data elements corresponding to the row m of the second matrix and the K 32-bit floating-point data elements corresponding to the column n of the third matrix, including performing floating point rounding;
accumulate the two dot products with a 32-bit floating-point data element corresponding to a row m of the M rows, and corresponding to a column n of the N columns, of the first matrix, including performing floating point rounding, to generate a result 32-bit single precision floating-point data element; and
store the result 32-bit floating-point data element in a position of the first storage location corresponding to the row m and the column n of the first matrix.
|