US 12,260,338 B2
	Transformer-based neural network including a mask attention network
Jian Jiao, Bellevue, WA (US); Yeyun Gong, Beijing (CN); Nan Duan, Beijing (CN); Ruofei Zhang, Mountain View, CA (US); and Ming Zhou, Beijing (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Aug. 27, 2020, as Appl. No. 17/005,067.
Prior Publication US 2022/0067533 A1, Mar. 3, 2022
Int. Cl. G06N 3/088 (2023.01); G06N 3/045 (2023.01); G06N 3/063 (2023.01)

CPC G06N 3/088 (2013.01) [G06N 3/045 (2023.01); G06N 3/063 (2013.01)]

20 Claims

1. A computer-readable storage medium for storing computer-readable instructions, the computer-readable instructions, when executed by one or more hardware processors, performing a method by a particular transformer at a particular level in a stack of transformers of a transformer-based neural network, the method comprising:

receiving a sequence of data items provided by an application;

in a first neural network in a pipeline of neural networks of the particular transformer, processing the sequence of data items using a mask attention network to produce a first output result,

the mask attention network performing operations of:

generating an original attention data structure that expresses influence between pairs of data items in the sequence of data items;

dynamically generating a mask data structure that is a mask that contains mask values, a particular mask value of the mask values being produced by determining a separation between a particular pair of data items in the sequence of data items at the particular level, selecting and retrieving a distance-related machine-trained value based on the separation between the particular pair of data items that has been determined, and determining the particular mask value based, in part, on the machine-trained value,

the machine-trained value that is selected and retrieved being independent of meanings of the particular pair of data items,

the distance-related machine-trained value being selected and retrieved from a stored set of distance-related machine-trained values associated with different respective separations between pairs of data items and for different levels,

a training system having previously produced the stored set of distance-related machine-trained values for the different respective separations by iteratively operating on a set of training examples to achieve a training objective;

producing a modified attention data structure by modifying the original attention data structure by the mask values provided by the mask data structure;

processing the first output result using another attention network that does not use a mask data structure and the stored set of distance-related machine-trained values associated with different separations, to provide a second output result, said another attention network being a second neural network that follows the first neural network in the pipeline; and

processing the second output result by a feed-forward neural network to produce a third output result, the feed-forward neural network being a third neural network that follows the second neural network in the pipeline,

the mask attention network, said another attention network, and the feed-forward neural network also being implemented by the computer-readable instructions provided by the computer-readable storage medium,

the method having a resource-efficiency that depends on a number of machine-trained values that are used to generate the mask data structure.