CPC G06F 7/78 (2013.01) [G06F 9/3001 (2013.01); G06F 9/30036 (2013.01); G06F 9/30038 (2023.08); G06F 9/3016 (2013.01); G06F 17/16 (2013.01)] | 20 Claims |
1. A processor, comprising:
a plurality of registers to store a plurality of packed data elements including a first plurality of packed data elements of a first source matrix tile and a second plurality of packed data elements of a second source matrix tile, the first and second source matrix tiles comprising respective portions of a first source matrix and a second source matrix, and wherein each packed data element of the plurality of packed data elements has an element width;
a decoder to decode one or more instructions, at least one instruction of the one or more instructions including an opcode field configured to specify an opcode, a first source operand configured to indicate the first plurality of packed data elements, a second source operand configured to indicate the second plurality of packed data elements, and a destination operand configured to indicate a result matrix tile;
execution circuitry, in response to the one or more instructions, to transpose the first source matrix tile in accordance with a granularity equal to the element width to generate a first transposed source matrix tile comprising the first plurality of packed data elements and to multiply the first transposed source matrix tile and the second source matrix tile, the execution circuitry comprising:
a plurality of multipliers to multiply data elements of the first transposed source matrix tile and corresponding data elements of the second source matrix tile to produce a corresponding plurality of products; and
one or more accumulators to add groups of the products to generate corresponding result data elements in the result matrix tile.
|