| CPC G10L 15/25 (2013.01) [G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/49 (2022.01); G06V 40/20 (2022.01); G10L 25/30 (2013.01)] | 16 Claims |

|
1. An electronic apparatus, comprising:
circuitry configured to:
receive a video that comprises at least one human speaker;
extract a plurality of frames from the video;
generate a prediction corresponding to lip movements of the at least one human speaker in the plurality of frames, wherein
the prediction includes probability values corresponding to a set of class labels for each frame of the plurality of frames,
the set of class labels corresponds to a plurality of word characters and at least one blank space,
the each frame of the plurality of frames corresponds to at least one word character of the plurality of word characters or the at least one blank space,
the prediction is generated without audio information,
the prediction is generated based on a first application of a Deep Neural Network (DNN) on the video, and
the DNN is trained using a connectionist temporal classification (CTC) loss function;
detect, based on the prediction of the at least one blank space, at least one word boundary in a sequence of characters that correspond to the lip movements;
divide the video into a sequence of video clips based on the detected at least one word boundary, wherein
each video clip of the sequence of video clips is without the audio information, and
the each video clip of the sequence of video clips corresponds to a word spoken by the at least one human speaker;
generate a sequence of word predictions based on the sequence of video clips, wherein each word prediction of the sequence of word predictions is generated based on a second application of the DNN on a corresponding video clip of the sequence of video clips; and
generate one of a sentence or a phrase based on the generated sequence of word predictions.
|