US 12,223,269 B2
	Language-model pretraining with gradient-disentangled embedding sharing
Pengcheng He, Sammamish, WA (US); Jianfeng Gao, Woodinville, WA (US); and Weizhu Chen, Kirkland, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on May 18, 2022, as Appl. No. 17/664,031.
Claims priority of provisional application 63/264,163, filed on Nov. 16, 2021.
Prior Publication US 2023/0153532 A1, May 18, 2023
Int. Cl. G06F 40/284 (2020.01); G06F 40/295 (2020.01); G06N 3/08 (2023.01); G06N 5/04 (2023.01)

CPC G06F 40/284 (2020.01) [G06F 40/295 (2020.01); G06N 3/08 (2013.01); G06N 5/04 (2013.01)]

20 Claims

9. A language-processing service configured for natural language understanding (NLU), the language-processing service comprising:

a language model including:

an upstream sequence of transformer blocks configured to receive vectorized training data and emit modified vectorized training data during pretraining, the upstream sequence of transformer blocks including an upstream data embedding,

a downstream sequence of transformer blocks configured to receive the modified vectorized training data and emit pretraining output during the pretraining, the downstream sequence of transformer blocks including a downstream data embedding equivalent to the upstream data embedding, wherein pretraining logic operative during the pretraining is configured to adjust the upstream data embedding and the downstream data embedding by computing a gradient of the upstream data embedding disentangled from a gradient of the downstream data embedding, wherein the gradient of the upstream data embedding is computed based on a loss function of the upstream sequence of transformer blocks and not on a loss function of the downstream sequence of transformer blocks, and wherein the gradient of the downstream data embedding is computed based on the loss function of the upstream sequence of transformer blocks and the loss function of the downstream sequence of transformer blocks,

wherein the upstream and downstream sequences of transformer blocks are configured to execute collectively a multitask pretraining problem;

an input module configured to convey language input to the language-processing model; and

an output module configured to expose an output of the language-processing model.