US 12,394,417 B2
Cascaded audiovisual automatic speech recognition models
Oscar Chang, New York, NY (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Feb. 2, 2023, as Appl. No. 18/163,836.
Prior Publication US 2024/0265917 A1, Aug. 8, 2024
Int. Cl. G10L 15/24 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/08 (2006.01); G10L 15/16 (2006.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01); G10L 15/25 (2013.01); G10L 25/57 (2013.01); G10L 15/30 (2013.01)
CPC G10L 15/24 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/083 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 15/25 (2013.01); G10L 25/57 (2013.01); G10L 15/30 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A cascaded audiovisual automated speech recognition (AV-ASR) model comprising:
an audio encoder configured to:
receive, as input, a sequence of acoustic frames; and
generate, at each of a plurality of output steps, a corresponding acoustic higher-order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;
an audiovisual encoder configured to:
receive, as input, a sequence of video frames; and
for each corresponding acoustic frame in the sequence of acoustic frames paired with a corresponding one of the video frames in the sequence of video frames:
receive, as input, the corresponding acoustic higher-order feature representation for the corresponding acoustic frame generated by the audio encoder; and
generate a corresponding audiovisual higher-order feature representation for the corresponding acoustic higher-order feature frame and the corresponding one of the video frames in the sequence of video frames; and
a decoder configured to:
for each corresponding acoustic frame in the sequence of acoustic frames paired with the corresponding one of the video frames in the sequence of video frames, receive, as input, the corresponding audiovisual higher-order feature representation;
for each corresponding acoustic frame in the sequence of acoustic frames that is not paired with any video frame in the sequence of video frames, receive, as input, the corresponding acoustic higher-order feature representation; and
generate, at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses.