US 12,461,993 B2
Parameter utilization for language pre-training
Chen Xing, Palo Alto, CA (US); Wenhao Liu, Redwood City, CA (US); Chu Hong Hoi, Singapore (SG); Nitish Shirish Keskar, San Francisco, CA (US); and Caiming Xiong, Menlo Park, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Jun. 10, 2024, as Appl. No. 18/738,628.
Application 18/738,628 is a continuation of application No. 17/532,851, filed on Nov. 22, 2021, granted, now 12,072,955.
Claims priority of provisional application 63/194,141, filed on May 27, 2021.
Prior Publication US 2024/0330409 A1, Oct. 3, 2024
Int. Cl. G06F 18/21 (2023.01); G06F 18/214 (2023.01); G06F 40/00 (2020.01)
CPC G06F 18/2148 (2023.01) [G06F 18/2163 (2023.01); G06F 40/00 (2020.01)] 20 Claims
OG exemplary drawing
 
1. A method for training a neural network model, the method comprising:
dividing the neural network model stored in a memory into a held-out model and a main model;
during a first forward pass on the held-out model, determining, using a training dataset comprising words in a natural language, held-out model hidden states from attention heads of the held-out model;
determining, a first loss based on the first forward pass and a first backward pass on the held-out model;
during a second forward pass on the main model:
determining, using the training dataset, main model hidden states from attention heads of the main model;
concatenating the held-out model hidden states and the main model hidden states into concatenated hidden states; and
propagating the concatenated hidden states through a subset of layers of the main model, wherein the concatenated hidden states cause the main model to recognize language patterns different from the held-out model;
determining a second loss based on the second forward pass and a second backward pass on the main model;
updating parameters of the held-out model based on the first loss; and
updating parameters of the main model based on the second loss.