US 12,067,351 B2
Systems and methods for extracting text from portable document format data
Stephan Schwiebert, Sydney (AU); Velislava Yanchina, Byron Bay (AU); and Henrry Eduardo Iguaro Jaramillo, Sydney (AU)
Assigned to CANVA PTY LTD, Surry Hills (AU)
Filed by Canva Pty Ltd, Surry Hills (AU)
Filed on Feb. 28, 2022, as Appl. No. 17/682,718.
Claims priority of application No. 2021201352 (AU), filed on Mar. 2, 2021.
Prior Publication US 2022/0284175 A1, Sep. 8, 2022
Int. Cl. G06F 17/00 (2019.01); G06F 40/166 (2020.01); G06F 40/205 (2020.01); G06F 40/279 (2020.01); G06V 30/10 (2022.01); G06F 40/10 (2020.01)
CPC G06F 40/166 (2020.01) [G06F 40/205 (2020.01); G06F 40/279 (2020.01); G06V 30/10 (2022.01); G06F 40/10 (2020.01)] 19 Claims
OG exemplary drawing
 
1. A computer implemented method including:
accessing, by a computer system including a processing unit, portable document format (PDF) text data, the PDF text data defining a plurality of glyphs by way of data specifying glyph attribute values;
sorting, by the processing unit, the plurality of glyphs into a plurality of glyph sets, wherein sorting the plurality of glyphs into the plurality of glyph sets includes:
associating a first subset of glyphs from the plurality of glyphs with a first glyph set, wherein the first glyph set has an associated first set of grouping attribute values and wherein the association of the first subset of glyphs with the first glyph set is based on the glyph attribute values of the first subset of glyphs corresponding to the first set of grouping attribute values; and
associating a second subset of glyphs from the plurality of glyphs with a second glyph set, wherein the second glyph set has an associated second set of grouping attribute values and wherein the association of the second subset of glyphs with the second glyph set is based on a determination by the processing unit that the glyph attribute values in the second subset of glyphs correspond to the second set of grouping attribute values, the second set of grouping attribute values being different to the first set of grouping attribute values;
calculating, for each glyph, an expanded bounding box, wherein a different expanded bounding box is calculated for each glyph by manipulating an initial bounding box of each glyph;
processing the first glyph set to determine one or more text areas, each text area being associated with one or more glyphs, and wherein processing the first glyph set to determine the one or more text areas includes:
identifying a first distinct group of glyphs from the first glyph set, the first distinct group of glyphs including one or more glyphs from the first glyph set which have collectively overlapping expanded bounding boxes;
associating each glyph in the first distinct group of glyphs with a first text area; and
identifying a second distinct group of glyphs from the first glyph set, the second distinct group of glyphs including one or more glyphs from the first glyph set which are not in the first distinct group of glyphs and which have collectively overlapping expanded bounding boxes; and
associating each glyph in the second distinct group of glyphs with a second text area.