US 12,456,033 B2
	Multi-stream recurrent neural network transducer(s)
Khe Chai Sim, Dublin, CA (US); and Françoise Beaufays, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/619,643
Filed by GOOGLE LLC, Mountain View, CA (US)
PCT Filed Dec. 15, 2020, PCT No. PCT/US2020/065065 § 371(c)(1), (2) Date Dec. 16, 2021, PCT Pub. No. WO2021/162779, PCT Pub. Date Aug. 19, 2021.
Claims priority of provisional application 62/976,315, filed on Feb. 13, 2020.
Prior Publication US 2022/0405549 A1, Dec. 22, 2022
Int. Cl. G06N 3/044 (2023.01); G06N 3/08 (2023.01)

CPC G06N 3/044 (2023.01) [G06N 3/08 (2013.01)]

21 Claims

1. A method implemented by one or more processors, the method comprising:

jointly generating a first output stream sequence and a second output stream sequence, using a multi-stream recurrent neural network transducer (MS RNN-T), wherein the MS RNN-T comprises an input stream encoder, a first output stream encoder, a second output stream encoder, and a joint network, wherein jointly generating the first output stream sequence and the second output stream sequence, using the MS RNN-T comprises:

initializing an input stream sequence using an initial segment in a sequence of segments, wherein the input stream sequence is based on user interface input of at least one user of a computing device;

initializing the first output stream sequence as empty;

initializing the second output stream sequence as empty;

for each of the segments, in the sequence, and until one or more conditions are satisfied:

generating an encoded representation of the input stream sequence by processing the input stream sequence using the input stream encoder;

generating an encoded representation of the first output stream sequence by processing the first output stream sequence using the first output stream encoder;

generating an encoded representation of the second output stream sequence by processing the second output stream sequence using the second output stream encoder;

generating predicted output by processing (1) the encoded representation of the input stream sequence, (2) the encoded representation of the first output stream sequence, and (3) the encoded representation of the second output stream sequence, using the joint network;

determining whether the predicted output corresponds to the first output stream sequence or the second output stream sequence;

if the predicted output corresponds to the first output stream sequence, updating the first output stream sequence based on the predicted output;

if the predicted output corresponds to the second output stream sequence, updating the second output stream sequence based on the predicted output; and

updating the input stream sequence based on the next segment in the sequence of the segments

generating a response to the user interface input based on the first output stream and/or the second output stream; and

causing the computing device to render the response to the at least one user.