US 11,928,600 B2
	Sequence-to-sequence prediction using a neural network model
Nitish Shirish Keskar, San Bruno, CA (US); Karim Ahmed, Lebanon, NH (US); and Richard Socher, Menlo Park, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Jan. 30, 2018, as Appl. No. 15/884,125.
Claims priority of provisional application 62/578,374, filed on Oct. 27, 2017.
Prior Publication US 2019/0130273 A1, May 2, 2019
Int. Cl. G06N 3/084 (2023.01); G06F 40/47 (2020.01); G06F 40/58 (2020.01)

CPC G06N 3/084 (2013.01) [G06F 40/47 (2020.01); G06F 40/58 (2020.01)]

20 Claims

12. A system for sequence-to-sequence prediction comprising:

a memory storing a plurality of processor-executable instructions; and

a processor reading and executing the processor-executable instructions from the memory to perform operations comprising:

an encoder stage that generates, by an encoder, an encoded representation based on an input sequence, wherein the encoder includes a set of branched attention layers arranged sequentially, each branched attention layer includes a plurality of branches arranged in parallel, and each branch includes a respective attention sublayer and a respective scaling sublayer, wherein the generating includes generating, at each branched attention encoder layer of the set of branched attention encoder layers, a respective layer encoded representation for a corresponding branched attention decoder layer in a decoder by:

determining, at the respective attention sublayer of a respective branch, a respective learned scaling parameter depending on one or more other learned scaling parameters from one or more other branches;

applying, at the respective scaling sublayer of the respective branch, a respective attention to the input sequence in parallel to other branches, based on scaling a respective intermediate representation of the respective branch by the respective learned scaling parameter;

an aggregation node that aggregates a plurality of branch output representations generated by each of the plurality of branches to generate the respective layer encoded representation; and

a decoder stage that predicts, by a decoder comprising a set of branched attention decoder layers, an output sequence based on a set of respective layer encoded representations sequentially received from the set of branched attention encoder layers.