US 12,260,248 B2
Systems and methods for performing multiplication of one or more matrices using multi-thread systolic arrays
Tal Horowitz, Munich (DE); Uri Weiser, Munich (DE); Zuguang Wu, Hangzhou (CN); Huibin Luo, Hangzhou (CN); and Yoni Choukroun, Munich (DE)
Assigned to Huawei Technologies Co., Ltd., Shenzhen (CN)
Filed by Huawei Technologies Co., Ltd., Shenzhen (CN)
Filed on Feb. 25, 2020, as Appl. No. 16/800,799.
Application 16/800,799 is a continuation of application No. PCT/EP2017/073854, filed on Sep. 21, 2017.
Prior Publication US 2020/0192701 A1, Jun. 18, 2020
Int. Cl. G06F 9/48 (2006.01); G06F 9/38 (2018.01); G06F 15/80 (2006.01)
CPC G06F 9/4843 (2013.01) [G06F 9/38 (2013.01); G06F 9/3851 (2013.01); G06F 9/3888 (2023.08); G06F 15/8046 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A multi-thread systolic array comprising:
a plurality of processing elements each including a processor, a first input interface, a second input interface, a computation component, a shifting path component and a stalling component;
a respective suspender queue for each input of the plurality of processing elements,
wherein each of the processing elements is configured to:
receive a plurality of first inputs by the first input interface from a respective first input source;
receive a plurality of second inputs by the second input interface from a respective second input source,
wherein the plurality of first inputs and the plurality of second inputs are arranged as a plurality of pairs corresponding to a plurality of threads, wherein each thread includes one of the first inputs paired with one of the second inputs;
schedule, by the processor, for each operation cycle of the processor, a certain thread of the plurality of threads for execution of a computation operation by each processor of the respective processing element of the multi-thread systolic array, wherein the scheduling is performed according to available impacting values of the certain thread based on analysing the received plurality of the first and second inputs;
execute the computation operation for the certain thread by the computation component; and
parallel execute a bypass operation for a first thread of other threads in the plurality of threads by the shifting path component, when at least one value of the first thread is a non-impacting value, and a stalling operation for a second thread of the other threads by the stalling component, when at least one value of the second thread is an impacting value, wherein the stalling operation comprises locally storing a set of values that are being stalled by the stalling operation associated with the second thread within the respective suspender queue, and
wherein the bypassing is performed without execution of the computation operation.