US 12,347,158 B2
	Pre-training method, image and text retrieval method for a vision and scene text aggregation model, electronic device, and storage medium
Yipeng Sun, Beijing (CN); Mengjun Cheng, Beijing (CN); Longchao Wang, Beijing (CN); Xiongwei Zhu, Beijing (CN); Kun Yao, Beijing (CN); Junyu Han, Beijing (CN); Jingtuo Liu, Beijing (CN); Errui Ding, Beijing (CN); Jingdong Wang, Beijing (CN); and Haifeng Wang, Beijing (CN)
Assigned to Beijing Baidu Netcom Science Technology Co., Ltd. China, Beijing (CN)
Filed by Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed on Mar. 29, 2023, as Appl. No. 18/192,393.
Claims priority of application No. 202210590151.1 (CN), filed on May 26, 2022.
Prior Publication US 2023/0386168 A1, Nov. 30, 2023
Int. Cl. G06V 10/42 (2022.01); G06F 16/332 (2025.01); G06F 16/532 (2019.01); G06F 16/583 (2019.01); G06F 18/25 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G06V 10/774 (2022.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06F 40/30 (2020.01)

CPC G06V 10/42 (2022.01) [G06F 16/332 (2019.01); G06F 16/532 (2019.01); G06F 16/5846 (2019.01); G06F 18/253 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06V 10/774 (2022.01); G06V 10/806 (2022.01); G06V 10/82 (2022.01); G06F 40/30 (2020.01)]

19 Claims

1. A pre-training method for a Vision and Scene Text Aggregation model, wherein the Vision and Scene Text Aggregation model comprises a text encoding network and a visual scene encoding network, the visual scene encoding network comprises a visual encoding subnetwork and a scene encoding subnetwork, and the method comprises:

acquiring a sample image-text pair, wherein the sample image-text pair comprises a sample image and a sample text;

extracting a sample scene text from the sample image;

inputting the sample text into the text encoding network to obtain a sample text feature;

inputting the sample image and an initial sample aggregation feature into the visual encoding subnetwork and inputting the initial sample aggregation feature and the sample scene text into the scene encoding subnetwork to obtain a global image feature of the sample image and a learned sample aggregation feature; and

pre-training the Vision and Scene Text Aggregation model according to the sample text feature, the global image feature of the sample image, and the learned sample aggregation feature.

13. An image and text retrieval method for a Vision and Scene Text Aggregation model, wherein the Vision and Scene Text Aggregation model comprises a text encoding network and a visual scene encoding network, the visual scene encoding network comprises a visual encoding subnetwork and a scene encoding subnetwork, and the method comprises:

acquiring a target text for retrieval;

extracting a candidate scene text from a candidate image of candidate images;

inputting the target text into the text encoding network to obtain a target text feature;

inputting the candidate image and an initial candidate aggregation feature into the visual encoding subnetwork and inputting the initial candidate aggregation feature and the candidate scene text into the scene encoding subnetwork to obtain a global image feature of the candidate image; and

determining a target image from the candidate images according to the target text feature and the global image feature of the candidate image.