US 11,836,520 B2
Dynamic batching for inference system for transformer-based generation tasks
Gyeongin Yu, Seoul (KR); Geon-Woo Kim, Seoul (KR); Joo Seong Jeong, Seoul (KR); Soojeong Kim, Seoul (KR); and Byung-Gon Chun, Seoul (KR)
Assigned to FRIENDLIAI INC., Seoul (KR)
Filed by FriendliAI Inc., Seoul (KR)
Filed on Aug. 4, 2022, as Appl. No. 17/881,549.
Application 17/881,549 is a continuation of application No. 17/542,193, filed on Dec. 3, 2021, granted, now 11,442,775.
Prior Publication US 2023/0176903 A1, Jun. 8, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/48 (2006.01); G06N 5/04 (2023.01); G06N 20/00 (2019.01); G06F 9/50 (2006.01); G06N 3/08 (2023.01); G06N 3/045 (2023.01)
CPC G06F 9/4881 (2013.01) [G06F 9/5016 (2013.01); G06N 5/04 (2013.01); G06N 20/00 (2019.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01)] 66 Claims
OG exemplary drawing
 
1. A method of dynamically executing batches of requests on one or more execution engines running a machine-learning transformer model, comprising:
receiving, by a serving system, one or more requests for execution, the serving system including a scheduler and one or more execution engines each coupled to access a machine-learning transformer model including at least a set of decoders;
scheduling, by the scheduler, a batch of requests including the one or more requests for execution on an execution engine;
generating, by the execution engine, a first set of output tokens by applying the transformer model to a first set of inputs for the batch of requests, wherein applying the transformer model comprises applying at least one batch operation to one or more input tensors associated with the batch of requests;
receiving, by the serving system, a new request from a client device;
obtaining a sequence of input tokens for the new request;
scheduling, by the scheduler, a second batch of requests for execution on the execution engine, the second batch of requests including the new request and at least one request in the batch of requests, wherein in a second set of inputs for the second batch of requests, a length of the sequence of input tokens for the new request is different from a length of an input for the at least one request; and
generating, by the execution engine, a second set of output tokens by applying the transformer model to the second set of inputs for the second batch.