US 12,430,166 B2
Hierarchical task scheduling for accelerators
Narasinga Rao Miniskar, Oak Ridge, TN (US); Frank Y. Liu, Oak Ridge, TN (US); Aaron R. Young, Oak Ridge, TN (US); Jeffrey S. Vetter, Oak Ridge, TN (US); and Dwaipayan Chakraborty, Oak Ridge, TN (US)
Assigned to UT-Battelle, LLC, Oak Ridge, TN (US)
Filed by UT-Battelle, LLC, Oak Ridge, TN (US)
Filed on Dec. 3, 2021, as Appl. No. 17/542,022.
Claims priority of provisional application 63/124,268, filed on Dec. 11, 2020.
Prior Publication US 2022/0188155 A1, Jun. 16, 2022
Int. Cl. G06F 9/48 (2006.01); G06F 9/50 (2006.01)
CPC G06F 9/4881 (2013.01) [G06F 9/5016 (2013.01); G06F 9/5027 (2013.01); G06F 2209/501 (2013.01); G06F 2209/5017 (2013.01)] 28 Claims
OG exemplary drawing
 
1. A system for scheduling tasks among a plurality of accelerator circuits, the system comprising:
a hierarchical task scheduler comprising:
a coarse scheduling circuit module configured to:
receive task-set metadata comprising a task graph having vertices representing respective tasks and directed edges, each from a respective source vertex of the vertices to a respective destination vertex of the vertices, joining respective pairs of the vertices, each of the directed edges having a weight representing a measure of data transfer from an upstream task represented by the respective source vertex to a downstream task represented by the respective destination vertex;
based on the task graph, schedule tasks from the task-set metadata among the plurality of accelerator circuits to minimize a makespan; and
dispatch the scheduled tasks to the plurality of accelerator circuits; and
two or more fine scheduling circuit modules, wherein each fine scheduling circuit module is communicatively coupled with the coarse scheduling circuit module and with a corresponding accelerator circuit from among the plurality of accelerator circuits, the corresponding accelerator circuit having a limited amount of local memory storage for computation and data transfers;
wherein the each fine scheduling circuit module comprises:
an interface sub-module configured to receive, from the coarse scheduling circuit module, the tasks scheduled for the corresponding accelerator circuit; and
an accelerator-specific scheduler (AS) sub-module configured to:
partition, for a given task of the received scheduled tasks, the given task into two or more streams of first sub-tasks, including: a first stream comprising computation sub-tasks each requiring one tile of the local memory storage; and a second stream comprising data-transfer sub-tasks; and
schedule the computation sub-tasks to execute synchronously and in parallel with the data-transfer sub-tasks on the corresponding accelerator circuit, each of the computation sub-tasks executing in a respective first time slot and having input or output data transferred by a respective one of the data-transfer sub-tasks in a second time slot adjacent to the first time slot;
wherein the AS sub-module enables the received scheduled tasks to be executed at the corresponding accelerator circuit with the limited amount of the local memory storage being less than or equal to two tiles of the local memory storage.