US 12,260,340 B2
	Extreme language model compression with optimal sub-words and shared projections
Yang Song, Bellevue, WA (US); Raghav Gupta, Mountain View, CA (US); Dengyong Zhou, Redmond, WA (US); and Sanqiang Zhao, Pittsburgh, PA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 21, 2023, as Appl. No. 18/471,866.
Application 18/471,866 is a continuation of application No. 16/749,570, filed on Jan. 22, 2020, granted, now 11,797,862.
Prior Publication US 2024/0013059 A1, Jan. 11, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/088 (2023.01); G06F 40/284 (2020.01); G06N 3/045 (2023.01)

CPC G06N 3/088 (2013.01) [G06F 40/284 (2020.01); G06N 3/045 (2023.01)]

31 Claims

1. A computing system for training a machine-learned model, the computing system comprising:

one or more processors; and

one or more non-transitory computer-readable media that collectively store:

a first language model comprising one or more transformer layers, wherein the first language model includes a plurality of first language model parameters, wherein each first language model parameter of the plurality of first language model parameters is associated with at least one transformer layer of the one or more transformer layers of the first language model;

a second language model comprising one or more transformer layers, wherein the second language model includes a plurality of second language model parameters, wherein each second language model parameter of the plurality of second language model parameters is associated with at least one transformer layer of the one or more transformer layers of the second language model, wherein the one or more transformer layers of the second language model are of a different dimension than the one or more transformer layers of the first language model, and

instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:

projecting the first language model parameters into a shared space with the second language model parameters; and

training the second language model using a loss function based on a comparison of the projected first language model parameters and the second language model parameters.