US 12,242,948 B2
	Systems and methods for routing within multitask mixture-of-experts models
Yanping Huang, Mountain View, CA (US); Dmitry Lepikhin, Sunnyvale, CA (US); Maxim Krikun, Castro Valley, CA (US); Orhan Firat, Mountain View, CA (US); Ankur Bapna, Sunnyvale, CA (US); Thang Luong, Santa Clara, CA (US); and Sneha Kudugunta, Sunnyvale, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 27, 2021, as Appl. No. 17/159,437.
Prior Publication US 2022/0237435 A1, Jul. 28, 2022
Int. Cl. G06N 3/045 (2023.01); G06N 3/08 (2023.01)

CPC G06N 3/045 (2023.01) [G06N 3/08 (2013.01)]

18 Claims

1. A computer-implemented method of processing an input sequence in a transformer having an encoder and a decoder, the method comprising:

generating, by one or more processors of a processing system, a first tokenized input sequence based on the input sequence, the first tokenized input sequence comprising a plurality of tokens corresponding to a task;

for each given token of the plurality of tokens, by the one or more processors:

generating a first vector representing the given token;

at a first layer of the encoder:

routing, based on a learned gating function of a mixture-of-experts (MoE) sublayer of the first layer, the first vector to two or more expert feed-forward networks of a set of expert feed-forward networks of the MoE sublayer of the first layer; and

generating a second vector based on processing the first vector in the two or more expert feed-forward networks of the set of expert feed-forward networks of the MoE sublayer of the first layer; and

at a second layer of the encoder including a single feed-forward network sublayer, generating a third vector based on processing the second vector in the single feed-forward network sublayer;

generating, by the one or more processors, a combined encoder output vector corresponding to the task and based on each third vector for each given token of the plurality of tokens; and

for each given element of a plurality of elements in a target sequence vector, by the one or more processors:

generating a fourth vector based on the combined encoder output vector and a target sequence vector;

routing, based on a learned gating function of the decoder, the fourth vector to two or more expert feed-forward networks of a set of expert feed-forward networks of the decoder;

generating a fifth vector based on processing the fourth vector in the two or more expert feed-forward networks of the set of expert feed-forward networks of the decoder; and

modifying the given element of the target sequence vector based on the fifth vector.