US 12,462,589 B2
	Text line detection
Lei Sun, Beijing (CN); Qiang Huo, Beijing (CN); Chixiang Ma, Beijing (CN); and Zhuoyao Zhong, Beijing (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/783,250
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Jan. 17, 2020, PCT No. PCT/CN2020/072700 § 371(c)(1), (2) Date Jun. 7, 2022, PCT Pub. No. WO2021/142765, PCT Pub. Date Jul. 22, 2021.
Prior Publication US 2023/0036812 A1, Feb. 2, 2023
Int. Cl. G06V 20/62 (2022.01); G06F 40/30 (2020.01); G06V 30/148 (2022.01); G06V 30/18 (2022.01); G06V 30/262 (2022.01)

CPC G06V 20/63 (2022.01) [G06F 40/30 (2020.01); G06V 30/153 (2022.01); G06V 30/18133 (2022.01); G06V 30/274 (2022.01)]

20 Claims

1. A computer-implemented method, comprising:

determining, using a scene text detection model trained on a plurality of training images comprising text of multiple orientations, fonts, and scales, a first text region and a second text region in an image, the first text region comprising a first portion of at least a first text element and the second text region comprising a second portion of at least a second text element;

extracting, using a convolutional neural network configured to process multiple color spaces, a first feature representation from the first text region and a second feature representation from the second text region, the first and second feature representations comprising at least one of an image feature representation or a semantic feature representation;

determining, using a semantic information extraction model, a semantic relationship between the first text region and the second text region by performing semantic information extraction on content of the first text element and the second text element using the semantic information extraction model to obtain semantic features representing textual meaning of the first and second text elements;

extracting a third feature representation using the semantic relationship by performing multimodal stitching and fusion on the semantic features of the first and second text elements;

determining, based at least in part on an evaluation of the first feature representation, the second feature representation, and the third feature representation using a relation prediction machine learning model, a link relationship between the first and second text regions, the link relationship indicating whether the first portion of the first text element and the second portion of the second text element are located in a same text line, wherein the relation prediction model is trained using training samples comprising pairs of text regions with known linking relationships;

according to the link relationship indicating that at least a portion of the first text element in the first text region and at least a portion of the second text element in the second text region are located in the same text line, determining a first text line region that at least defines the first text region and the second text region in the image; and

outputting, via a display device, the first text line region.