US 12,277,767 B2
Multimodal unsupervised video temporal segmentation for summarization
Hailin Jin, San Jose, CA (US); Jielin Qiu, Pittsburgh, PA (US); Zhaowen Wang, San Jose, CA (US); Trung Huu Bui, San Jose, CA (US); and Franck Dernoncourt, San Jose, CA (US)
Assigned to ADOBE INC., San Jose, CA (US)
Filed by ADOBE INC., San Jose, CA (US)
Filed on May 31, 2022, as Appl. No. 17/804,656.
Prior Publication US 2023/0386208 A1, Nov. 30, 2023
Int. Cl. G06V 20/00 (2022.01); G06F 16/34 (2019.01); G06F 16/683 (2019.01); G06V 10/774 (2022.01); G06V 20/40 (2022.01)
CPC G06V 20/47 (2022.01) [G06F 16/345 (2019.01); G06F 16/685 (2019.01); G06V 10/774 (2022.01); G06V 20/49 (2022.01)] 18 Claims
OG exemplary drawing
 
1. A method comprising:
receiving a video and a transcript of the video;
generating initial visual features representing frames of the video using an image encoder by performing a convolution process on the frames of the video;
generating initial language features representing the transcript using a text encoder;
performing a first non-linear transformation on the initial language features based on a cross-correlation between the initial language features and the initial visual features using a language feature transformer to obtain correlated language features;
performing a second non-linear transformation on the initial visual features based on a cross-correlation between the initial visual features and the initial language features using a visual feature transformer different from the language feature transformer to obtain correlated visual features; and
segmenting the video into a plurality of video segments based on the correlated visual features and the correlated language features.