US 11,941,872 B2
Progressive localization method for text-to-video clip localization
Xun Wang, Hangzhou (CN); Jianfeng Dong, Hangzhou (CN); Qi Zheng, Hangzhou (CN); and Jingwei Peng, Hangzhou (CN)
Assigned to ZHEJIANG GONGSHANG UNIVERSITY, Hangzhou (CN)
Filed by ZHEJIANG GONGSHANG UNIVERSITY, Zhejiang (CN)
Filed on Apr. 19, 2023, as Appl. No. 18/303,534.
Application 18/303,534 is a continuation of application No. PCT/CN2020/127657, filed on Nov. 10, 2020.
Claims priority of application No. 202011164289.2 (CN), filed on Oct. 27, 2020.
Prior Publication US 2023/0260267 A1, Aug. 17, 2023
Int. Cl. G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01)
CPC G06V 10/806 (2022.01) [G06V 10/82 (2022.01); G06V 20/46 (2022.01)] 10 Claims
OG exemplary drawing
 
1. A progressive localization method for text-to-video clip localization, comprising:
step 1: extracting a video feature and a text feature, respectively, by using different feature extraction methods;
step 2: coarse-time-granularity localization: sampling the video feature obtained in step 1 with a first step length to generate a candidate clip;
step 3: fusing the candidate clip obtained in step 2 with the text feature obtained in step 1;
step 4: feeding the fused feature to a convolution neural network to obtain a coarse-grained feature map, and then obtaining a correlation score map via an fully-connected (FC) layer;
step 5: fine-time-granularity localization: sampling the video feature obtained in step 1 with a second step size, updating the features by a conditional feature update module by combining with the feature map obtained in step (4), and then generating a candidate clip, wherein the first step is greater than the second step;
step 6: fusing the candidate clip in step 5 with the text features obtained in step 1, and fusing the candidate clip in step 5 by combining the feature map obtained in step 4 through up-sampling connection;
step 7: feeding the fused features to the convolution neural network to obtain a fine-grained feature map, and then obtaining a correlation score map by an FC layer;
step 8: calculating a loss value of the correlation score matrices obtained in step 4 and step 7 by using binary cross entropy loss, respectively, combining the loss value with a certain weight, and finally training a model in an end-to-end manner; and
step 9: realizing text-based video clip localization by using the model trained in step 8.