US 12,380,332 B2
Forcing weights of transformer model layers
Andy Wagner, Cupertino, CA (US); Tiyasa Mitra, San Jose, CA (US); and Marc Tremblay, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Sep. 9, 2020, as Appl. No. 17/016,184.
Prior Publication US 2022/0076127 A1, Mar. 10, 2022
Int. Cl. G06N 3/084 (2023.01); G06N 3/063 (2023.01)
CPC G06N 3/084 (2013.01) [G06N 3/063 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A system comprising:
a set of processing units; and
a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit to:
train a transformer model over repeated iterations of training operations, the training operations facilitating a gradual convergence of weights in adjacent layers of a transformer model such that the weights in a pair of the adjacent layers approach identical values without transmitting weights between the adjacent layers, the training operations including:
receive, at a first layer included in the transformer model, input data;
process the input data through the first layer of the transformer model so that the first layer of the transformer model outputs a first output data;
process the first output data through the first layer of the transformer model so that the first layer of the transformer model outputs a second output data;
process the first output data through a second layer included in the transformer model so that the second layer of the transformer model outputs a third output data;
calculate a difference between the second output data of the first layer and the third output data of the second layer; and
adjust first weights included in the first layer of the transformer model based on a summation between gradient data and a difference term, the gradient data being received at the first layer during a backpropagation training step and the difference term quantifying a difference between the second output data generated by the first layer and the third output data generated by the second layer.