CPC G06N 3/088 (2013.01) [G06N 3/045 (2023.01); G06N 3/063 (2013.01)] | 20 Claims |
1. A computer-readable storage medium for storing computer-readable instructions, the computer-readable instructions, when executed by one or more hardware processors, performing a method by a particular transformer at a particular level in a stack of transformers of a transformer-based neural network, the method comprising:
receiving a sequence of data items provided by an application;
in a first neural network in a pipeline of neural networks of the particular transformer, processing the sequence of data items using a mask attention network to produce a first output result,
the mask attention network performing operations of:
generating an original attention data structure that expresses influence between pairs of data items in the sequence of data items;
dynamically generating a mask data structure that is a mask that contains mask values, a particular mask value of the mask values being produced by determining a separation between a particular pair of data items in the sequence of data items at the particular level, selecting and retrieving a distance-related machine-trained value based on the separation between the particular pair of data items that has been determined, and determining the particular mask value based, in part, on the machine-trained value,
the machine-trained value that is selected and retrieved being independent of meanings of the particular pair of data items,
the distance-related machine-trained value being selected and retrieved from a stored set of distance-related machine-trained values associated with different respective separations between pairs of data items and for different levels,
a training system having previously produced the stored set of distance-related machine-trained values for the different respective separations by iteratively operating on a set of training examples to achieve a training objective;
producing a modified attention data structure by modifying the original attention data structure by the mask values provided by the mask data structure;
processing the first output result using another attention network that does not use a mask data structure and the stored set of distance-related machine-trained values associated with different separations, to provide a second output result, said another attention network being a second neural network that follows the first neural network in the pipeline; and
processing the second output result by a feed-forward neural network to produce a third output result, the feed-forward neural network being a third neural network that follows the second neural network in the pipeline,
the mask attention network, said another attention network, and the feed-forward neural network also being implemented by the computer-readable instructions provided by the computer-readable storage medium,
the method having a resource-efficiency that depends on a number of machine-trained values that are used to generate the mask data structure.
|