| CPC G06N 3/084 (2013.01) [G06N 3/063 (2013.01)] | 17 Claims |

|
1. A system comprising:
a set of processing units; and
a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit to:
train a transformer model over repeated iterations of training operations, the training operations facilitating a gradual convergence of weights in adjacent layers of a transformer model such that the weights in a pair of the adjacent layers approach identical values without transmitting weights between the adjacent layers, the training operations including:
receive, at a first layer included in the transformer model, input data;
process the input data through the first layer of the transformer model so that the first layer of the transformer model outputs a first output data;
process the first output data through the first layer of the transformer model so that the first layer of the transformer model outputs a second output data;
process the first output data through a second layer included in the transformer model so that the second layer of the transformer model outputs a third output data;
calculate a difference between the second output data of the first layer and the third output data of the second layer; and
adjust first weights included in the first layer of the transformer model based on a summation between gradient data and a difference term, the gradient data being received at the first layer during a backpropagation training step and the difference term quantifying a difference between the second output data generated by the first layer and the third output data generated by the second layer.
|