US 12,307,250 B2
	Systems and methods for performing 16-bit floating-point matrix dot product instructions
Alexander F. Heinecke, San Jose, CA (US); Robert Valentine, Kiryat Tivon (IL); Mark J. Charney, Lexington, MA (US); Raanan Sade, Portland, OR (US); Menachem Adelman, Modi'in (IL); Zeev Sperber, Zichron Yackov (IL); Amit Gradstein, Binyamina (IL); and Simon Rubanovich, Haifa (IL)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Dec. 27, 2023, as Appl. No. 18/397,664.
Application 18/397,664 is a continuation of application No. 18/190,761, filed on Mar. 27, 2023, granted, now 11,893,389.
Application 18/190,761 is a continuation of application No. 17/216,566, filed on Mar. 29, 2021, granted, now 11,614,936, issued on Mar. 28, 2023.
Application 17/216,566 is a continuation of application No. 16/186,387, filed on Nov. 9, 2018, granted, now 10,963,246, issued on Mar. 30, 2021.
Prior Publication US 2024/0126545 A1, Apr. 18, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/30 (2018.01); G06F 9/38 (2018.01)

CPC G06F 9/30036 (2013.01) [G06F 9/3001 (2013.01); G06F 9/30038 (2023.08); G06F 9/3016 (2013.01); G06F 9/3802 (2013.01)]

20 Claims

1. A processor comprising:

an instruction cache to store a matrix multiplication instruction;

a plurality of vector registers to store a plurality of single-precision floating point source data elements of a first matrix comprising M rows by N columns, a first plurality of bfloat16 floating-point source data elements of a second matrix comprising M rows and K columns, and a second plurality of bfloat16 floating point source data elements of a third matrix comprising K rows by N columns;

decode circuitry to decode the matrix multiplication instruction, the matrix multiplication instruction including an opcode to indicate a matrix multiplication operation, a first field to indicate a first storage location associated with the plurality of single-precision floating point source data elements, a second field to indicate a second storage location associated with the first plurality of bfloat16 floating-point source data elements, and a third field to indicate a third storage location associated with the second plurality of bfloat16 floating point source data elements, wherein the first, second, and third storage locations are locations in the plurality of vector registers; and

execution circuitry coupled with the decode circuitry, the execution circuitry to perform operations in accordance with the matrix multiplication instruction, the execution circuitry to, for each row m of the M rows of the second matrix and each column n of the N columns of the third matrix:

generate a dot product from K bfloat16 floating-point source data elements corresponding to the row m of the second matrix and K bfloat16 floating-point source data elements corresponding to the column n of the third matrix, and

accumulate the dot product with a single precision floating-point source data element corresponding to a row m of the M rows and a column n of the N columns of the first matrix to generate a single-precision floating-point result data element to be stored in a position of the plurality of vector registers corresponding to the row m and the column n of the first matrix.