US 12,443,839 B2
	Hyperparameter transfer via the theory of infinite-width neural networks
Jingfeng Hu, Redmond, WA (US); Ge Yang, Redmond, WA (US); Xiaodong Liu, Redmond, WA (US); and Jianfeng Gao, Woodinville, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Aug. 21, 2020, as Appl. No. 17/000,065.
Prior Publication US 2022/0058477 A1, Feb. 24, 2022
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01)

CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01)]

20 Claims

1. A method for tuning one or more hyperparameters of a large neural network model, wherein the large neural network model comprising an infinitely-wide neural network model, the method comprising:

receiving the large neural network model;

parameterizing the large neural network model according to a parameterization scheme, wherein when a transformer is used in the large neural network model, the parameterization scheme comprises a dot-product attention logit scaler hyperparameter;

reducing a width of at least one layer of the large neural network model resulting in a smaller neural network model, the smaller neural network model comprising at least a reduced width of one or more layers of the infinitely-wide neural network model;

performing a hyperparameter tuning process using the smaller neural network model to identify a tuned hyperparameter, wherein a model scaling process is based on an estimated amount of computational resources, energy, and/or time is used to tune the hyperparameter, wherein the hyperparameter is tuned using a logit scaling parameter;

identifying the optimized tuple of hyperparameters based on the tuning process;

adjusting the tuning process based on the optimized tuple of hyperparameters, wherein the optimized tuple minimizes a predetermined loss function;

returning the smaller neural network based on using the adjusted tuning process; and

transferring the tuned hyperparameter to the large neural network model using an identified scaling factor.