US 11,790,214 B2
	Mixture of experts neural networks
Noam M. Shazeer, Palo Alto, CA (US); Azalia Mirhoseini, San Jose, CA (US); and Krzysztof Stanislaw Maziarz, Jaslo (PL)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 20, 2020, as Appl. No. 16/879,187.
Application 16/879,187 is a continuation of application No. 16/393,063, filed on Apr. 24, 2019, granted, now 10,719,761.
Application 16/393,063 is a continuation of application No. PCT/US2017/059909, filed on Nov. 3, 2017.
Claims priority of provisional application 62/432,497, filed on Dec. 9, 2016.
Claims priority of provisional application 62/418,135, filed on Nov. 4, 2016.
Prior Publication US 2020/0279150 A1, Sep. 3, 2020
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/045 (2023.01); G06N 3/08 (2023.01)

CPC G06N 3/045 (2023.01) [G06N 3/08 (2013.01)]

20 Claims

1. A system comprising:

a main neural network implemented by one or more computers, the main neural network comprising a Mixture of Experts (MoE) subnetwork between a first neural network layer and a second neural network layer in the main neural network, wherein the MoE subnetwork comprises:

a plurality of expert neural networks, wherein each expert neural network is configured to process a first layer output generated by the first neural network layer in accordance with a respective set of expert parameters of the expert neural network to generate a respective expert output, and

a gating subsystem configured to:

generate a modified first layer output by applying a set of gating parameters to the first layer output,

add a final noise output to the modified first layer output to generate an initial gating output, wherein the final noise output is a vector having a plurality of elements, wherein each of the plurality of elements corresponds to a respective expert neural network of the plurality of expert neural networks, and wherein the number of elements in the vector is the same as the number of expert neural networks in the plurality of expert neural network, and

select, based on the initial gating output generated by adding the final noise output to the modified first layer output, one or more of the expert neural networks and determine a respective weight for each selected expert neural network,

provide the first layer output as input to each of the selected expert neural networks,

combine the expert outputs generated by the selected expert neural networks in accordance with the weights for the selected expert neural networks to generate an MoE output, and

provide the MoE output as input to the second neural network layer.