| CPC G10L 15/24 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/083 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 15/25 (2013.01); G10L 25/57 (2013.01); G10L 15/30 (2013.01)] | 21 Claims |

|
1. A cascaded audiovisual automated speech recognition (AV-ASR) model comprising:
an audio encoder configured to:
receive, as input, a sequence of acoustic frames; and
generate, at each of a plurality of output steps, a corresponding acoustic higher-order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;
an audiovisual encoder configured to:
receive, as input, a sequence of video frames; and
for each corresponding acoustic frame in the sequence of acoustic frames paired with a corresponding one of the video frames in the sequence of video frames:
receive, as input, the corresponding acoustic higher-order feature representation for the corresponding acoustic frame generated by the audio encoder; and
generate a corresponding audiovisual higher-order feature representation for the corresponding acoustic higher-order feature frame and the corresponding one of the video frames in the sequence of video frames; and
a decoder configured to:
for each corresponding acoustic frame in the sequence of acoustic frames paired with the corresponding one of the video frames in the sequence of video frames, receive, as input, the corresponding audiovisual higher-order feature representation;
for each corresponding acoustic frame in the sequence of acoustic frames that is not paired with any video frame in the sequence of video frames, receive, as input, the corresponding acoustic higher-order feature representation; and
generate, at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses.
|