US 12,242,806 B2
Systems and methods for structure and header extraction
Richard Anthony Pito, Toronto (CA)
Assigned to Thomson Reuters Enterprise Centre GmbH, Zug (CH)
Filed by Thomson Reuters Enterprise Centre GmbH, Zug (CH)
Filed on Aug. 17, 2023, as Appl. No. 18/451,153.
Application 18/451,153 is a continuation of application No. 17/156,546, filed on Jan. 23, 2021, granted, now 11,763,079.
Claims priority of provisional application 62/975,514, filed on Feb. 12, 2020.
Claims priority of provisional application 62/965,516, filed on Jan. 24, 2020.
Claims priority of provisional application 62/965,523, filed on Jan. 24, 2020.
Claims priority of provisional application 62/965,520, filed on Jan. 24, 2020.
Prior Publication US 2024/0054286 A1, Feb. 15, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 40/279 (2020.01); G06F 3/0481 (2022.01); G06F 40/109 (2020.01); G06F 40/137 (2020.01); G06F 40/166 (2020.01); G06F 40/232 (2020.01); G06F 40/242 (2020.01); G06F 40/258 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01); G06V 30/416 (2022.01)
CPC G06F 40/279 (2020.01) [G06F 3/0481 (2013.01); G06F 40/109 (2020.01); G06F 40/137 (2020.01); G06F 40/166 (2020.01); G06F 40/232 (2020.01); G06F 40/242 (2020.01); G06F 40/258 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01); G06V 30/416 (2022.01)] 17 Claims
OG exemplary drawing
 
1. A system comprising memory and one or more processors communicatively coupled to the memory, the one or more processors configured to:
identify an orthography feature related to orthography characteristics for a respective header from a plurality of headers included in a document, wherein the orthography feature is determined based on capitalization formatting for a string of characters in a data chunk that corresponds to the respective header included in the document;
identify two or more typography features related to typography characteristics for the respective header from the plurality of headers included in the document, wherein the two or more typography features comprise at least (a) a first typography feature configured with a first binary value or a second binary value based on a font setting for at least one character in the data chunk and (b) a second typography feature configured based on a page layout setting for the document;
generate a graph representation of the plurality of headers based at least in part on the orthography feature and the two or more typography features, wherein respective vertices of the graph representation correspond to the respective headers;
compare the graph representation of the plurality of headers to a predefined graph representation to determine a performance metric for the graph representation; and
perform one or more computer-implemented processing tasks with respect to the document based at least in part on the graph representation and the performance metric.