| CPC G10L 15/32 (2013.01) [G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 2015/223 (2013.01)] | 19 Claims |

|
1. An automated speech recognition (ASR) model comprising:
a first encoder configured to:
receive, as input, a sequence of acoustic frames; and
generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;
a first decoder configured to:
receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and
generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses;
a second encoder configured to:
receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and
generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame; and
a second decoder configured to:
receive, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps; and
generate, at each of the plurality of output steps, a second probability distribution over possible speech recognition hypotheses,
wherein the first decoder and the second decoder each comprise a respective recurrent neural network-transducer (RNN-T) architecture having a same number of parameters.
|