US 12,423,058 B2
Systolic array with input reduction to multiple reduced inputs
Paul Gilbert Meyer, Rollingwood, TX (US); Thomas A Volpe, Austin, TX (US); Ron Diamant, Santa Clara, CA (US); Joshua Wayne Bowman, Austin, TX (US); Nishith Desai, Austin, TX (US); and Thomas Elmer, Austin, TX (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Jun. 30, 2021, as Appl. No. 17/363,900.
Prior Publication US 2023/0004523 A1, Jan. 5, 2023
Int. Cl. G06F 7/544 (2006.01); G06F 7/487 (2006.01); G06F 7/499 (2006.01); G06F 15/80 (2006.01)
CPC G06F 7/5443 (2013.01) [G06F 7/4876 (2013.01); G06F 7/49942 (2013.01); G06F 15/8046 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A systolic array processor organized in rows and columns, each row comprising:
a bit reducer configured to convert 32-bit input data elements into two 21-bit input data elements, the bit reducer comprising:
first hardware circuitry configured to convert a 32-bit input data element of the 32-bit input data elements into a first 21-bit input data element, the first 21-bit input data element corresponding to a set of most significant bits of a significand portion of the 32-bit input data element, the first hardware circuitry comprising:
second hardware circuitry configured to reduce a quantity of trailing bits representing the significand portion of the 32-bit input data element to produce a first reduced significand portion of the 32-bit input data element, the first reduced significand portion corresponding to the set of most significant bits; and
third hardware circuitry configured to increase a quantity of bits representing an exponent portion of the 32-bit input data element to produce a first increased exponent portion,
wherein the first hardware circuitry produces the first 21-bit input data element based on the first reduced significand portion and the first increased exponent portion; and
fourth hardware circuitry configured to convert the 32-bit input data element into a second 21-bit input data element, the second 21-bit input data element corresponding to a set of least significant bits of the significand portion of the 32-bit input data element, the fourth hardware circuitry comprising:
fifth hardware circuitry configured to reduce a quantity of leading bits representing the significand portion of the 32-bit input data element to produce a second reduced significand portion of the 32-bit input data element, the second reduced significand portion corresponding to the set of least significant bits; and
sixth hardware circuitry configured to increase the quantity of bits representing the exponent portion of the 32-bit input data element to produce a second increased exponent portion,
wherein the fourth hardware circuitry produces the second 21-bit input data element based on the second reduced significand portion and the second increased exponent portion; and
a plurality of processing elements, wherein a first processing element of a first row of the plurality of processing elements and a first column of the plurality of processing elements is configured to:
generate a multiplier product based on the first 21-bit input data element and a weight;
adjust a bit-length of a significand portion of the multiplier product to correspond to a bit-length supported by an adder of the first processing element;
generate an addition result based on addition of the adjusted multiplier product and an input partial sum, wherein a first output partial sum of the first processing element is based on the addition result, and wherein a second output partial sum of the first processing element is based on the second 21-bit input data element;
route the first 21-bit input data element and the second 21-bit input data element to a second processing element of the first row of the plurality of processing elements and a second column of the plurality of processing elements, wherein a 21 bit-length corresponds to a maximum supported bit-length of a multiplier of the first processing element; and
route the first output partial sum and the second output partial sum to a third processing element of a second row of the plurality of processing elements and the first column of the plurality of processing elements, wherein an output of the systolic array processor is based on a plurality of outputs of the first column of the plurality of processing elements generated using reduced inputs.