US 12,080,084 B2
Scene text detection method and system based on sequential deformation
Liangrui Peng, Beijing (CN); Shanyu Xiao, Beijing (CN); Ruijie Yan, Beijing (CN); Gang Yao, Beijing (CN); Shengjin Wang, Beijing (CN); Jaesik Min, Gyeonggi (KR); and Jong Ub Suk, Seoul (KR)
Assigned to TSINGHUA UNIVERSITY, Beijing (CN); HYUNDAI MOTOR COMPANY, Seoul (KR); and KIA CORPORATION, Seoul (KR)
Filed by Tsinghua University, Beijing (CN); HYUNDAI MOTOR COMPANY, Seoul (KR); and Kia Corporation, Seoul (KR)
Filed on Aug. 20, 2021, as Appl. No. 17/407,549.
Claims priority of application No. 202010853196.4 (CN), filed on Aug. 22, 2020.
Prior Publication US 2022/0058420 A1, Feb. 24, 2022
Int. Cl. G06V 20/62 (2022.01); G06F 18/214 (2023.01); G06N 3/04 (2023.01); G06N 3/084 (2023.01); G06V 10/22 (2022.01); G06V 10/40 (2022.01); G06V 30/10 (2022.01)
CPC G06V 20/63 (2022.01) [G06F 18/214 (2023.01); G06N 3/04 (2013.01); G06N 3/084 (2013.01); G06V 10/225 (2022.01); G06V 10/40 (2022.01); G06V 30/10 (2022.01)] 18 Claims
OG exemplary drawing
 
1. A method for detecting a scene text based on a feature extraction module, a sequential deformation module, an auxiliary character counting network, and an object detection baseline network, the method comprising:
extracting, by the feature extraction module, a first feature map for a scene image input based on a convolutional neural network, and delivering the first feature map to a sequential deformation module;
obtaining, by the sequential deformation module, sampled feature maps corresponding to sampling positions by performing iterative sampling through predicting an offset for each pixel of the first feature map, obtaining a second feature map by performing a concatenation operation in deep learning according to a channel dimension for the first feature map and the sampled feature maps obtained by the iterative sampling, and delivering the second feature map to an auxiliary character counting network;
obtaining, by the sequential deformation module, a third feature map by performing a feature aggregation operation for the second feature map in the channel dimension, and delivering the third feature map to the object detection baseline network; and
performing, by the object detection baseline network, text area candidate box extraction for the third feature map and obtaining a text area prediction result as a scene text detection result through regression fitting.