US 12,424,010 B2
	Character recognition model training method and apparatus, character recognition method and apparatus, device and storage medium
Pengyuan Lv, Beijing (CN); Chengquan Zhang, Beijing (CN); Shanshan Liu, Beijing (CN); Meina Qiao, Beijing (CN); Yangliu Xu, Beijing (CN); Liang Wu, Beijing (CN); Xiaoyan Wang, Beijing (CN); Kun Yao, Beijing (CN); Junyu Han, Beijing (CN); Errui Ding, Beijing (CN); Jingdong Wang, Beijing (CN); Tian Wu, Beijing (CN); and Haifeng Wang, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed by BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed on Feb. 14, 2023, as Appl. No. 18/168,759.
Claims priority of application No. 202210983230.9 (CN), filed on Aug. 16, 2022.
Prior Publication US 2023/0215203 A1, Jul. 6, 2023
Int. Cl. G06V 30/19 (2022.01); G06F 7/76 (2006.01); G06F 16/24 (2019.01); G06F 16/33 (2025.01); G06F 16/43 (2019.01); G06F 16/53 (2019.01); G06F 16/583 (2019.01); G06F 16/73 (2019.01); G06F 16/83 (2019.01); G06F 18/21 (2023.01); G06F 18/213 (2023.01); G06F 18/214 (2023.01); G06F 18/25 (2023.01); G06F 40/258 (2020.01); G06F 40/279 (2020.01); G06N 3/045 (2023.01); G06N 3/0455 (2023.01); G06N 3/0895 (2023.01); G06N 3/09 (2023.01); G06V 10/26 (2022.01); G06V 10/40 (2022.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/62 (2022.01); G06V 10/70 (2022.01); G06V 10/77 (2022.01); G06V 10/774 (2022.01); G06V 10/778 (2022.01); G06V 10/80 (2022.01); G06V 20/40 (2022.01); G06V 20/69 (2022.01); G06V 20/70 (2022.01); G06V 30/00 (2022.01); G06V 30/10 (2022.01); G06V 30/148 (2022.01); G06V 30/18 (2022.01); G06V 30/24 (2022.01); G06V 30/242 (2022.01); G06V 30/244 (2022.01); G06V 30/32 (2022.01); G06V 40/12 (2022.01); G06V 40/16 (2022.01); G06V 40/18 (2022.01); G06V 40/30 (2022.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01)

CPC G06V 30/19147 (2022.01) [G06F 7/764 (2013.01); G06F 16/24 (2019.01); G06F 16/33 (2019.01); G06F 16/43 (2019.01); G06F 16/53 (2019.01); G06F 16/5846 (2019.01); G06F 16/73 (2019.01); G06F 16/83 (2019.01); G06F 18/21 (2023.01); G06F 18/213 (2023.01); G06F 18/2155 (2023.01); G06F 18/253 (2023.01); G06F 40/258 (2020.01); G06F 40/279 (2020.01); G06N 3/045 (2023.01); G06N 3/0455 (2023.01); G06N 3/0895 (2023.01); G06N 3/09 (2023.01); G06V 10/26 (2022.01); G06V 10/40 (2022.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/62 (2022.01); G06V 10/70 (2022.01); G06V 10/7715 (2022.01); G06V 10/7753 (2022.01); G06V 10/7784 (2022.01); G06V 10/7788 (2022.01); G06V 10/7792 (2022.01); G06V 10/80 (2022.01); G06V 10/806 (2022.01); G06V 20/46 (2022.01); G06V 20/695 (2022.01); G06V 20/70 (2022.01); G06V 30/00 (2022.01); G06V 30/10 (2022.01); G06V 30/148 (2022.01); G06V 30/15 (2022.01); G06V 30/18 (2022.01); G06V 30/18143 (2022.01); G06V 30/18152 (2022.01); G06V 30/19127 (2022.01); G06V 30/19167 (2022.01); G06V 30/24 (2022.01); G06V 30/242 (2022.01); G06V 30/245 (2022.01); G06V 30/333 (2022.01); G06V 40/1347 (2022.01); G06V 40/1353 (2022.01); G06V 40/1359 (2022.01); G06V 40/168 (2022.01); G06V 40/193 (2022.01); G06V 40/382 (2022.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G06F 18/2163 (2023.01); G06F 18/2178 (2023.01); G06F 2218/08 (2023.01); G06T 2207/20021 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20112 (2013.01); G06V 2201/09 (2022.01)]

17 Claims

1. A character recognition method being applied to a server and comprising:

partitioning an untagged training sample into at least two sub-sample images;

dividing the at least two sub-sample images into a first training set and a second training set; wherein the first training set comprises a first sub-sample image with a visible attribute, and the second training set comprises a second sub-sample image with an invisible attribute; performing self-supervised training on a to-be-trained encoder by taking the second training set as a tag of the first training set, to obtain a target encoder;

wherein the performing the self-supervised training on the to-be-trained encoder by taking the second training set as the tag of the first training set, to obtain the target encoder comprises:

initializing the to-be-trained encoder to obtain a first encoder;

extracting, based on the first encoder, a first visual feature of the first sub-sample image in the first training set and a second visual feature of the second sub-sample image in the second training set;

performing mask query calculation on the first visual feature, to obtain a third visual feature; and

updating the first encoder according to a feature error between the third visual feature and the second visual feature until the feature error satisfies a first error condition, and determining a latest updated first encoder as the target encoder;

wherein the updating the first encoder according to the feature error between the third visual feature and the second visual feature until the feature error satisfies the first error condition, and the determining the latest updated first encoder as the target encoder comprise:

initializing a to-be-trained decoder to obtain a first decoder;

determining, based on the first decoder, an image error generated when image reconstruction is performed on the third visual feature;

determining the feature error between the third visual feature and the second visual feature; and

updating the first encoder based on the feature error and the image error and updating the first decoder based on the image error until the feature error satisfies the first error condition and the image error satisfies a second error condition, and determining a latest obtained first encoder as the target encoder;

receiving a to-be-recognized image sent by a terminal device, and performing, based on the target encoder and the updated first decoder, image features extraction on the to-be-recognized image to obtain a target text; and

sending the target text to the terminal device.