CPC G10L 15/16 (2013.01) [G06N 3/04 (2013.01); G06N 3/088 (2013.01); G10L 15/063 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01)] | 18 Claims |
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving audio data corresponding to a spoken utterance;
encoding, by an initial stack of multi-head attention layers, the audio data to compute shared activations:
while receiving the audio data corresponding to the spoken utterance:
encoding, by a final stack of multi-head attention layers while applying a first look ahead audio context, the shared activations to compute low latency activations; and
decoding the low latency activations into partial speech recognition results for the spoken utterance; and
after the audio data corresponding to the spoken utterance is received:
encoding, by the final stack of multi-head attention layers while applying a second look ahead audio context, the shared activations to compute high latency activations; and
decoding the high latency activations into a final speech recognition result for the spoken utterance,
wherein:
the initial stack of multi-head attention layers are trained with zero look ahead audio context; and
the final stack of multi-head attention layers are trained with variable look ahead audio context.
|