US 12,277,927 B2
End-to-end streaming speech translation with neural transducer
Jinyu Li, Bellevue, WA (US); Jian Xue, Bellevue, WA (US); Matthew John Post, Baltimore, MD (US); and Peidong Wang, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Mar. 15, 2022, as Appl. No. 17/695,218.
Prior Publication US 2023/0298566 A1, Sep. 21, 2023
Int. Cl. G06F 40/284 (2020.01); G06F 40/58 (2020.01); G10L 15/00 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01)
CPC G10L 15/063 (2013.01) [G06F 40/284 (2020.01); G06F 40/58 (2020.01); G10L 15/005 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method for implementing an end-to-end automatic speech translation (AST) model with a neural transducer, the method comprising:
accessing a training dataset comprising an audio dataset comprising spoken language utterances in a first language and a text dataset comprising transcription labels in a second language, the transcription labels corresponding to the spoken language utterances;
accessing an end-to-end AST model based on a neural transducer comprising at least an acoustic encoder which is configured to receive and encode audio data, a prediction network which is integrated in a parallel model architecture with the acoustic encoder in the end-to-end AST model and configured to predict a subsequent language token based on a previous transcription label output;
applying the training dataset to the end-to-end AST model;
generating a transcription in the second language of input audio data in the first language based on the trained end-to-end AST model; and
causing the acoustic encoder to learn a plurality of temporal processing paths.