US 12,287,816 B1
Reducing latency by processing parts of a language model query in parallel
Sayan Dev Pathak, Kirkland, WA (US); Osama Abuelsorour, Menlo Park, CA (US); Christopher Hakan Basoglu, Everett, WA (US); Harini Kesavamoorthy, Bellevue, WA (US); Girish Milind Mahajan, Redmond, WA (US); Salman Mohammad Quazi, Mountain View, CA (US); and Valeriy Viktorovich Kirshin, Kirkland, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Oct. 31, 2023, as Appl. No. 18/385,408.
Int. Cl. G06F 16/33 (2019.01); G06F 16/332 (2019.01); G06F 16/3329 (2025.01); G06F 16/334 (2025.01)
CPC G06F 16/3329 (2019.01) [G06F 16/3344 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method for processing a query using a machine-trained language model, comprising:
receiving an original query;
generating component queries based on the original query, the component queries having a same common part, and the component queries having different respective instance-specific parts;
distributing the component queries to respective processor instances, the processor instances being instances of one or more processors, each processor instance executing an instance of the machine-trained language model,
the processor instances generating respective component-query responses in parallel based on the plural component queries, and based on intermediate results previously generated by the machine-trained language model and stored in a cache memory, the previously generated intermediate results including key-value information used in performing an attention operation in the machine-trained language model;
receiving the component-query responses;
generating a final response based on the component-query responses; and
generating output information based on the final response.