US 12,367,881 B2
	Visual speech recognition based on connectionist temporal classification loss
Shiwei Jin, San Diego, CA (US); Jong Hwa Lee, San Diego, CA (US); Matthew Wnuk, San Diego, CA (US); and Francisco Costela, San Diego, CA (US)
Assigned to SONY GROUP CORPORATION, Tokyo (JP)
Filed by SONY GROUP CORPORATION, Tokyo (JP)
Filed on Mar. 8, 2022, as Appl. No. 17/689,270.
Claims priority of provisional application 63/262,049, filed on Oct. 4, 2021.
Prior Publication US 2023/0106951 A1, Apr. 6, 2023
Int. Cl. G10L 15/25 (2013.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 40/20 (2022.01); G10L 25/30 (2013.01)

CPC G10L 15/25 (2013.01) [G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/49 (2022.01); G06V 40/20 (2022.01); G10L 25/30 (2013.01)]

16 Claims

1. An electronic apparatus, comprising:

circuitry configured to:

receive a video that comprises at least one human speaker;

extract a plurality of frames from the video;

generate a prediction corresponding to lip movements of the at least one human speaker in the plurality of frames, wherein

the prediction includes probability values corresponding to a set of class labels for each frame of the plurality of frames,

the set of class labels corresponds to a plurality of word characters and at least one blank space,

the each frame of the plurality of frames corresponds to at least one word character of the plurality of word characters or the at least one blank space,

the prediction is generated without audio information,

the prediction is generated based on a first application of a Deep Neural Network (DNN) on the video, and

the DNN is trained using a connectionist temporal classification (CTC) loss function;

detect, based on the prediction of the at least one blank space, at least one word boundary in a sequence of characters that correspond to the lip movements;

divide the video into a sequence of video clips based on the detected at least one word boundary, wherein

each video clip of the sequence of video clips is without the audio information, and

the each video clip of the sequence of video clips corresponds to a word spoken by the at least one human speaker;

generate a sequence of word predictions based on the sequence of video clips, wherein each word prediction of the sequence of word predictions is generated based on a second application of the DNN on a corresponding video clip of the sequence of video clips; and

generate one of a sentence or a phrase based on the generated sequence of word predictions.