US 11,941,395 B2
	Apparatuses, methods, and systems for instructions for 16-bit floating-point matrix dot product instructions
Alexander F. Heinecke, San Jose, CA (US); Robert Valentine, Kiryat Tivon (IL); Mark J. Charney, Lexington, MA (US); Menachem Adelman, Haifa (IL); Christopher J. Hughes, Santa Clara, CA (US); Evangelos Georganas, San Mateo, CA (US); Zeev Sperber, Zichron Yackov (IL); Amit Gradstein, Binyamina (IL); and Simon Rubanovich, Haifa (IL)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Dec. 24, 2020, as Appl. No. 17/134,008.
Claims priority of provisional application 63/083,908, filed on Sep. 26, 2020.
Prior Publication US 2022/0100502 A1, Mar. 31, 2022
Int. Cl. G06F 9/30 (2018.01); G06F 7/544 (2006.01); G06F 9/38 (2018.01); G06F 17/16 (2006.01); G06N 3/08 (2023.01)

CPC G06F 9/3001 (2013.01) [G06F 7/5443 (2013.01); G06F 9/30145 (2013.01); G06F 9/3802 (2013.01); G06F 17/16 (2013.01); G06N 3/08 (2013.01)]

24 Claims

1. An apparatus comprising:

fetch circuitry to fetch a single instruction having fields to specify an opcode and locations of a M by N destination matrix having single-precision elements, an M by K first source matrix, and a K by N second source matrix, the source matrices having elements that each comprise a pair of half-precision floating-point values, the opcode to indicate execution circuitry is to cause, for each element of the first source matrix and corresponding element of the second source matrix, a conversion of the half-precision floating-point values to single-precision values, a multiplication of converted single-precision values from first values of the pairs together to generate a first result, a multiplication of converted single-precision values from second values of the pairs together to generate a second result, and an accumulation of the first result and the second result with previous contents of a corresponding element of the destination matrix;

decode circuitry to decode the fetched instruction; and

the execution circuitry to respond to the decoded instruction as specified by the opcode.