US 12,405,770 B2
	Matrix transpose and multiply
Menachem Adelman, Modi'in (IL); Robert Valentine, Kiryat Tivon (IL); Barukh Ziv, Haifa (IL); Amit Gradstein, Binyamina (IL); Simon Rubanovich, Haifa (IL); Zeev Sperber, Zichron Yackov (IL); Mark J. Charney, Lexington, MA (US); Christopher J. Hughes, Santa Clara, CA (US); Alexander F. Heinecke, San Jose, CA (US); Evangelos Georganas, San Jose, CA (US); and Binh Pham, Burlingame, CA (US)
Assigned to Intel Corporation, Santa Clara, CA (US)
Filed by Intel Corporation, Santa Clara, CA (US)
Filed on Mar. 15, 2024, as Appl. No. 18/607,024.
Application 18/607,024 is a continuation of application No. 16/914,318, filed on Jun. 27, 2020, granted, now 11,972,230.
Prior Publication US 2024/0329938 A1, Oct. 3, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 7/78 (2006.01); G06F 9/30 (2018.01); G06F 17/16 (2006.01)

CPC G06F 7/78 (2013.01) [G06F 9/3001 (2013.01); G06F 9/30036 (2013.01); G06F 9/30038 (2023.08); G06F 9/3016 (2013.01); G06F 17/16 (2013.01)]

20 Claims

1. A processor, comprising:

a plurality of registers to store a plurality of packed data elements including a first plurality of packed data elements of a first source matrix tile and a second plurality of packed data elements of a second source matrix tile, the first and second source matrix tiles comprising respective portions of a first source matrix and a second source matrix, and wherein each packed data element of the plurality of packed data elements has an element width;

a decoder to decode one or more instructions, at least one instruction of the one or more instructions including an opcode field configured to specify an opcode, a first source operand configured to indicate the first plurality of packed data elements, a second source operand configured to indicate the second plurality of packed data elements, and a destination operand configured to indicate a result matrix tile;

execution circuitry, in response to the one or more instructions, to transpose the first source matrix tile in accordance with a granularity equal to the element width to generate a first transposed source matrix tile comprising the first plurality of packed data elements and to multiply the first transposed source matrix tile and the second source matrix tile, the execution circuitry comprising:

a plurality of multipliers to multiply data elements of the first transposed source matrix tile and corresponding data elements of the second source matrix tile to produce a corresponding plurality of products; and

one or more accumulators to add groups of the products to generate corresponding result data elements in the result matrix tile.