CPC G06F 16/75 (2019.01) [G06V 10/762 (2022.01); G06V 10/806 (2022.01); G06V 20/46 (2022.01); G06V 20/63 (2022.01)] | 20 Claims |
1. A method, comprising:
obtaining a video frame in a video to be matched;
automatically extracting, by a machine learning model, an image feature and a text feature from the video frame;
automatically fusing, by a machine learning model, the image feature and the text feature based on a same center variable that represents a cluster center to obtain a fused feature, the center variable configured to associate features of different modes of a same video; and
performing video retrieval in a video database based on the fused feature to determine a video in the video database that matches the video to be matched,
wherein the fusing the image feature and the text feature includes, for each one of the image feature or the text feature:
determining a distance between the one of the image feature or the text feature to the same center variable;
determining a weight of the one of the image feature or the text feature based on the distance;
generating an aligned feature of the one of the image feature or the text feature based on the one of the image feature or the text feature and the weight; and
generating the fused feature using the aligned feature.
|