US 11,657,799 B2
Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition
Rui Zhao, Bellevue, WA (US); Jinyu Li, Redmond, WA (US); Liang Lu, Redmond, WA (US); Yifan Gong, Sammamish, WA (US); and Hu Hu, Atlanta, GA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Apr. 3, 2020, as Appl. No. 16/840,311.
Prior Publication US 2021/0312905 A1, Oct. 7, 2021
Int. Cl. G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 15/16 (2006.01); G10L 15/06 (2013.01); G06N 3/04 (2023.01); G06N 3/08 (2023.01)
CPC G10L 15/063 (2013.01) [G06N 3/0445 (2013.01); G06N 3/08 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A data processing system comprising:
a processor; and
a computer-readable medium storing executable instructions for causing the processor to perform operations of:
receiving an audio input comprising spoken content;
analyzing the audio input using a Recurrent Neural Network-Transducer (RNN-T) to obtain a first textual output representing the spoken content, the RNN-T being pretrained using whole network pretraining, wherein the whole-network pretraining pretrains the RNN-T as whole using a cross-entropy (CE) criterion by pretraining an encoder of the RNN-T using a two-dimensional label matrix for each utterance included in training data to train an encoder of the RNN-T and by pretraining a prediction network of the RNN-T using a three-dimensional label matrix derived from the two-dimensional label matrix, wherein the CE criterion represents a divergence between expected outputs and reference outputs of a model; and
processing the first textual output with an application on the data processing system.