US 11,836,438 B2
	ML using n-gram induced input representation
Pengcheng He, Beijing (CN); Xiaodong Liu, Beijing (CN); Jianfeng Gao, Woodinville, WA (US); and Weizhu Chen, Kirkland, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Apr. 13, 2021, as Appl. No. 17/229,140.
Claims priority of provisional application 63/142,907, filed on Jan. 28, 2021.
Prior Publication US 2021/0232753 A1, Jul. 29, 2021
Int. Cl. G06F 40/126 (2020.01); G06N 3/08 (2023.01); G06F 40/151 (2020.01)

CPC G06F 40/126 (2020.01) [G06F 40/151 (2020.01); G06N 3/08 (2013.01)]

20 Claims

1. A system comprising:

processing circuitry;

a memory coupled to the processing circuitry, the memory including a program stored thereon that, when executed by the processing circuitry, cause the processing circuitry to perform operations comprising:

converting a string of words to a series of tokens;

generating, using the series of tokens as input, a local string-dependent embedding of each token of the series of tokens;

generating, using the series of tokens as input and independent of the local string-dependent embedding, a global string-dependent embedding of each token of the series of tokens;

combining the local string-dependent embedding and the global string-dependent embedding to generate an n-gram induced embedding of each token of the series of tokens;

generating, by a position embedder, a relative position embedding that includes a vector representation of a position of each token in the series of tokens;

combining the relative position embedding and the n-gram induced embedding resulting in a disentangled attention embedding for each token of the series of tokens;

obtaining a masked language model (MLM) previously trained to generate a masked word prediction; and

executing the MLM based on the disentangled attention embedding of each token to generate the masked word prediction.