US 12,481,841 B2
Pre-training a unified natural language model with corrupted span and replaced token detection
Pengcheng He, Sammamish, WA (US); Jianfeng Gao, Woodinville, WA (US); Nanshan Zeng, Bellevue, WA (US); Xuedong Huang, Yarrow Point, WA (US); Wei Xiong, Bellevue, WA (US); and Baolin Peng, Issaquah, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Oct. 20, 2022, as Appl. No. 17/970,174.
Claims priority of provisional application 63/398,443, filed on Aug. 16, 2022.
Prior Publication US 2024/0062018 A1, Feb. 22, 2024
Int. Cl. G06F 40/56 (2020.01); G06F 40/149 (2020.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06F 40/51 (2020.01)
CPC G06F 40/56 (2020.01) [G06F 40/149 (2020.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06F 40/51 (2020.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method for performing a two-step pre-training process for a natural language model, the method comprising:
accessing the natural language model;
obtaining a set of training data;
creating a set of tokens for the training data;
as a part of a first step of the two-step pre-training process, generating corrupted span data from the set of tokens by masking a subset of the set of tokens;
as a part of a second step of the two-step pre-training process, generating replacement token detection span data by replacing the subset of masked tokens in the corrupted span data with a set of ambiguous tokens;
incrementally modifying the corrupted span data by replacing, in an incremental manner, other masked tokens in the corrupted span data with replacement tokens, wherein said incremental modification of the corrupted span data is incrementally performed until a total number of the replacement tokens that are included in the corrupted span data reaches at least a specified percentage relative to a total number of tokens included in the corrupted span data, and wherein the specified percentage for the total number of replacement tokens relative to the total number of tokens is at least 10%; and
training an encoder of the natural language model with the replacement token detection span data and the incrementally modified corrupted span data.