US 12,072,955 B2
Parameter utilization for language pre-training
Chen Xing, Singapore (SG); Wenhao Liu, Redwood City, CA (US); Chu Hong Hoi, Singapore (SG); Nitish Shirish Keskar, San Francisco, CA (US); and Caiming Xiong, Menlo Park, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by salesforce.com, inc., San Francisco, CA (US)
Filed on Nov. 22, 2021, as Appl. No. 17/532,851.
Claims priority of provisional application 63/194,141, filed on May 27, 2021.
Prior Publication US 2022/0391640 A1, Dec. 8, 2022
Int. Cl. G06F 18/214 (2023.01); G06F 18/21 (2023.01); G06F 40/00 (2020.01)
CPC G06F 18/2148 (2023.01) [G06F 18/2163 (2023.01); G06F 40/00 (2020.01)] 20 Claims
OG exemplary drawing
 
1. A method for pre-training a transformer model, the method comprising:
dividing the transformer model stored in memory into a held-out model and a main model, wherein the held-out model comprises attention heads of the transformer model from a portion of a predefined number of lower layers of the transformer model;
performing, using a training dataset, a forward pass on the held-out model, the forward pass determines self-attention hidden states of the held-out model at corresponding layers in the predefined number of lower layers;
performing, using the training dataset, a forward pass on the main model, wherein the forward pass comprises:
determining self-attention hidden states of the main model at a corresponding layer;
concatenating the self-attention hidden states of the main model at the corresponding layer with the self-attention hidden states of the held-out model at the corresponding layer, wherein the concatenated self-attention hidden states are inputs to a layer subsequent to the corresponding layer of the main model;
performing, a backward pass on the held-out model, the backward pass determines a loss of the held-out model;
performing, a backward pass on the main model, the backward pass determines a loss of the main model; and
updating parameters of the held-out model based on the loss of the held-out model and parameters of the main model based on the loss of the main model.