US 12,339,903 B2
Video retrieval method and apparatus
Kaixiang Ji, Zhejiang (CN); Liguo Feng, Zhejiang (CN); Jian Wang, Zhejiang (CN); Jingdong Chen, Zhejiang (CN); Jiajia Liu, Zhejiang (CN); Siyu Sun, Zhejiang (CN); Weixiang Hong, Zhejiang (CN); Qiqi Hu, Zhejiang (CN); Zhi Qiao, Zhejiang (CN); and Xiaoying Zeng, Zhejiang (CN)
Assigned to Alipay (Hangzhou) Information Technology Co., Ltd., Hangzhou (CN)
Filed by Alipay (Hangzhou) Information Technology Co., Ltd., Zhejiang (CN)
Filed on May 26, 2023, as Appl. No. 18/324,823.
Claims priority of application No. 202210592045.7 (CN), filed on May 27, 2022.
Prior Publication US 2023/0385336 A1, Nov. 30, 2023
Int. Cl. G06F 16/35 (2025.01); G06F 16/75 (2019.01); G06V 10/762 (2022.01); G06V 10/80 (2022.01); G06V 20/40 (2022.01); G06V 20/62 (2022.01)
CPC G06F 16/75 (2019.01) [G06V 10/762 (2022.01); G06V 10/806 (2022.01); G06V 20/46 (2022.01); G06V 20/63 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
obtaining a video frame in a video to be matched;
automatically extracting, by a machine learning model, an image feature and a text feature from the video frame;
automatically fusing, by a machine learning model, the image feature and the text feature based on a same center variable that represents a cluster center to obtain a fused feature, the center variable configured to associate features of different modes of a same video; and
performing video retrieval in a video database based on the fused feature to determine a video in the video database that matches the video to be matched,
wherein the fusing the image feature and the text feature includes, for each one of the image feature or the text feature:
determining a distance between the one of the image feature or the text feature to the same center variable;
determining a weight of the one of the image feature or the text feature based on the distance;
generating an aligned feature of the one of the image feature or the text feature based on the one of the image feature or the text feature and the weight; and
generating the fused feature using the aligned feature.