| CPC G06F 9/5011 (2013.01) [G06N 3/04 (2013.01); G06N 20/00 (2019.01); G06F 2212/2542 (2013.01); G06T 1/20 (2013.01)] | 18 Claims |

|
1. A method for scheduling tasks and allocating resources to perform a machine-learning (“ML”) workload using hardware accelerators that are each configured to implement a neural network comprising a plurality of neural network layers, the method comprising:
determining, based on a request to perform the ML workload, a resource requirement to perform the ML workload using a plurality of hosts;
generating, by a controller, a protocol bit indicating non-uniform memory access (NUMA) locality required for at least one task of the ML workload;
for each host of the plurality of hosts:
assigning, based on the protocol bit and a NUMA topology that includes NUMA nodes within the host, a task to be executed at the host using a plurality of hardware accelerators of the host; and
performing the ML workload by executing the task assigned to the host,
wherein the NUMA nodes include memory that is local to the host, the memory having a socket interface that couples the memory to each hardware accelerator of the plurality of hardware accelerators of the host and to a resource of the host, and wherein at least one NUMA topology specifies:
i) for a first host of the plurality of hosts, a first NUMA node that includes a first memory in a configuration of resources that is local to the first host, and
ii) a second, different memory in a configuration of resources that is local to a second, different host that is remote to the first NUMA node of the first host.
|