US 11,789,895 B2
	On-chip heterogeneous AI processor with distributed tasks queues allowing for parallel task execution
Ping Wang, Shanghai (CN); and Jianwen Li, Shanghai (CN)
Assigned to SHANGHAI DENGLIN TECHNOLOGIES CO., LTD., Shanghai (CN)
Filed by SHANGHAI DENGLIN TECHNOLOGIES CO., LTD., Shanghai (CN)
Filed on Mar. 9, 2020, as Appl. No. 16/812,817.
Claims priority of application No. 201910846915.7 (CN), filed on Sep. 9, 2019.
Prior Publication US 2021/0073169 A1, Mar. 11, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/48 (2006.01); G06F 15/78 (2006.01); G06N 20/00 (2019.01)

CPC G06F 15/781 (2013.01) [G06F 9/4881 (2013.01); G06F 15/7807 (2013.01); G06N 20/00 (2019.01)]

10 Claims

1. An on-chip heterogeneous Artificial Intelligence (AI) processor, comprising:

at least two different architectural types of computation units, each of the computation units being associated with a task queue configured to store computation subtasks to be executed by the computation unit, wherein a first one of the computation units is a customized computation unit for a particular AI algorithm or operation, and a second one of the computation units is a programmable computation unit;

a controller configured to partition a received computation graph associated with a neural network into a plurality of computation subtasks and distribute the plurality of computation subtasks to the respective task queues associated with the computation units, wherein the controller distributes each computation subtask according to a type of the computation subtask to the task queue associated with a computation unit suitable for processing the type of the computation subtask;

a storage unit configured to store data associated with executing the plurality of computation subtasks; and

an access interface configured to access an off-chip memory;

wherein the computation units are configured to support the following three operation modes: an independent parallel mode, a cooperative parallel mode, and an interactive cooperation mode, and wherein:

in the independent parallel mode, at least two of the plurality of computation subtasks are executed independently and in parallel with each other, and data dependence and synchronization associated with different computation subtasks are realized in an off-chip memory shared by the computation units;

in the cooperative parallel mode, at least two of the plurality of computation subtasks are executed cooperatively in a pipelined manner, and data dependence and synchronization associated with different computation subtasks are realized in an on-chip memory shared by the computation units; and

in the interactive cooperation mode, a first one of the computation units, during the execution of a computation subtask distributed to the first one of the computation units, waits for or depends on results generated by a second one of the computation units executing a computation subtask distributed to the second one of the computation units, and data dependence and synchronization associated with different computation subtasks are realized in a cache memory shared by the computation units.