US 11,893,389 B2
	Systems and methods for performing 16-bit floating-point matrix dot product instructions
Alexander F. Heinecke, San Jose, CA (US); Robert Valentine, Kiryat Tivon (IL); Mark J. Charney, Lexington, MA (US); Raanan Sade, Kibutz Sarid (IL); Menachem Adelman, Modi'in (IL); Zeev Sperber, Zichron Yackov (IL); Amit Gradstein, Biyamina (IL); and Simon Rubanovich, Haifa (IL)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Mar. 27, 2023, as Appl. No. 18/190,761.
Application 18/190,761 is a continuation of application No. 17/216,566, filed on Mar. 29, 2021, granted, now 11,614,936.
Application 17/216,566 is a continuation of application No. 16/186,387, filed on Nov. 9, 2018, granted, now 10,963,246, issued on Mar. 30, 2021.
Prior Publication US 2023/0236834 A1, Jul. 27, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/30 (2018.01); G06F 9/38 (2018.01)

CPC G06F 9/30036 (2013.01) [G06F 9/3001 (2013.01); G06F 9/3016 (2013.01); G06F 9/3802 (2013.01)]

24 Claims

1. A processor comprising:

a general-purpose central processing unit (CPU) core, comprising:

a control register to specify a round mode;

fetch circuitry to fetch an instruction;

decode circuitry to decode the instruction, the instruction having a first field to specify a first storage location of a plurality of data elements corresponding to a first matrix having M rows by N columns of 32-bit single precision floating-point data elements, a second field to specify a second storage location of a plurality of data elements corresponding to a second matrix having M rows by K columns of 16-bit floating-point data elements having a bfloat16 format, and a third field to specify a third storage location of a plurality of data elements corresponding to a third matrix having K rows by N columns of 16-bit floating-point data elements having the bfloat16 format; and

execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations corresponding to the instruction to, for each row m of the M rows of the second matrix, and for each column n of the N columns of the third matrix:

convert K 16-bit floating-point data elements corresponding to the row m of the second matrix to K 32-bit floating-point data elements, and convert K 16-bit floating-point data elements corresponding to the column n of the third matrix to K 32-bit floating-point data elements;

generate two dot products from the K 32-bit floating-point data elements corresponding to the row m of the second matrix and the K 32-bit floating-point data elements corresponding to the column n of the third matrix, including performing floating point rounding;

accumulate the two dot products with a 32-bit floating-point data element corresponding to a row m of the M rows, and corresponding to a column n of the N columns, of the first matrix, including performing floating point rounding, to generate a result 32-bit single precision floating-point data element; and

store the result 32-bit floating-point data element in a position of the first storage location corresponding to the row m and the column n of the first matrix.