US 12,236,245 B2
Group thread dispatch for graph streaming processor
Kota Vamsi Krishna Darsi, Hyderabad (IN); Sarvendra Govindammagari, Hyderabad (IN); Venkata Divyabharathi Palaparthy, Hyderabad (IN); Venkata Ganapathi Puppala, Hyderabad (IN); and Satyaki Koneru, Folsom, CA (US)
Assigned to Blaize Inc., El Dorado Hills, CA (US)
Filed by Blaize Inc., El Dorado Hills, CA (US)
Filed on Jun. 12, 2023, as Appl. No. 18/208,365.
Prior Publication US 2024/0411560 A1, Dec. 12, 2024
Int. Cl. G06F 9/30 (2018.01); G06F 9/38 (2018.01)
CPC G06F 9/3851 (2013.01) [G06F 9/30043 (2013.01); G06F 9/3887 (2013.01); G06F 9/30036 (2013.01); G06F 9/30098 (2013.01); G06F 9/3888 (2023.08)] 20 Claims
OG exemplary drawing
 
1. A method of group thread dispatch for a graph streaming processor, comprising
receiving, by a thread scheduler of the graph streaming processor, a group of threads, wherein the group of threads comprises a plurality of threads which operate on an input tensor, wherein each of the plurality of threads operates on inputs of the input tensor and a subset of a weight tensor to generate a subset of an output tensor;
calculating by the thread scheduler, a resource requirement for execution of the group of threads;
calculating, by the thread scheduler, resource availability in a plurality of processors of each of a plurality of processor arrays;
dispatching the group of threads to a selected one of the plurality of processors of the plurality of processor arrays that has a resource availability that meets or exceeds the resource requirement for execution of the group of threads; and
scheduling a group load instruction for all threads of the group of threads, comprising:
loading into a group load register a subset of inputs of the input tensor for processing of each thread of the group of threads, wherein the group load register provides the subset of the inputs of the input tensor to the group of threads of the selected one of the plurality of processors;
wherein all threads of the group of threads are synchronized when executing the group load instruction;
wherein all threads of the group of threads are processed independently on the selected one of the plurality of processors when not executing the group load instruction;
wherein the processing of each thread of the group of threads comprises generating a subset of outputs of the output tensor for each thread of the plurality of threads based on the subset of weights of the weight tensor and the inputs of the input tensor.