CPC G06V 20/47 (2022.01) [G06F 16/345 (2019.01); G06F 16/685 (2019.01); G06V 10/774 (2022.01); G06V 20/49 (2022.01)] | 18 Claims |
1. A method comprising:
receiving a video and a transcript of the video;
generating initial visual features representing frames of the video using an image encoder by performing a convolution process on the frames of the video;
generating initial language features representing the transcript using a text encoder;
performing a first non-linear transformation on the initial language features based on a cross-correlation between the initial language features and the initial visual features using a language feature transformer to obtain correlated language features;
performing a second non-linear transformation on the initial visual features based on a cross-correlation between the initial visual features and the initial language features using a visual feature transformer different from the language feature transformer to obtain correlated visual features; and
segmenting the video into a plurality of video segments based on the correlated visual features and the correlated language features.
|