| CPC G06F 16/3329 (2019.01) [G06F 16/3344 (2019.01)] | 20 Claims |

|
1. A method for processing a query using a machine-trained language model, comprising:
receiving an original query;
generating component queries based on the original query, the component queries having a same common part, and the component queries having different respective instance-specific parts;
distributing the component queries to respective processor instances, the processor instances being instances of one or more processors, each processor instance executing an instance of the machine-trained language model,
the processor instances generating respective component-query responses in parallel based on the plural component queries, and based on intermediate results previously generated by the machine-trained language model and stored in a cache memory, the previously generated intermediate results including key-value information used in performing an attention operation in the machine-trained language model;
receiving the component-query responses;
generating a final response based on the component-query responses; and
generating output information based on the final response.
|