CPC G06F 9/3001 (2013.01) [G06F 9/3005 (2013.01); G06F 9/3016 (2013.01); G06F 9/30036 (2013.01); G06F 9/30043 (2013.01); G06F 9/30076 (2013.01); G06F 9/30109 (2013.01); G06F 9/30123 (2013.01); G06F 9/30145 (2013.01); G06F 9/383 (2013.01); G06F 9/3824 (2013.01)] | 21 Claims |
1. A processor comprising:
decode circuitry configured to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices to contain doubleword elements; and
execution circuitry configured to execute the decoded single tile dot product instruction by causing a grid of fused multiply accumulate circuits, comprising a set of multipliers, an accumulator, and a saturation circuit, to perform a flow K times for each element (M,N) of the identified destination matrix, the flow comprising:
generating eight products by multiplying each nibble of a doubleword (32-bit) element (M,K) of the identified first source matrix by a corresponding nibble of a doubleword element (K,N) of the identified second source matrix with a corresponding multiplier of the set of multipliers;
accumulating the eight products with previous contents of the doubleword element (M,N) of the identified destination matrix to generate a sum by the accumulator;
saturating the sum to generate a saturated sum by the saturation circuit when the sum is beyond a range of an integer; and
storing the saturated sum at element (M,N) of the identified destination matrix.
|