US 11,727,215 B2
Searchable data structure for electronic documents
Jaidev Amrite, Austin, TX (US); Erik Skiles, Manor, TX (US); and Jashmi Lagisetty, Katy, TX (US)
Assigned to SPARKCOGNITION, INC., Austin, TX (US)
Filed by SparkCognition, Inc., Austin, TX (US)
Filed on Nov. 16, 2020, as Appl. No. 17/99,349.
Prior Publication US 2022/0156463 A1, May 19, 2022
Int. Cl. G06F 17/00 (2019.01); G06F 40/30 (2020.01); G06F 16/901 (2019.01); G06F 40/106 (2020.01); G06F 40/205 (2020.01)
CPC G06F 40/30 (2020.01) [G06F 16/9027 (2019.01); G06F 40/106 (2020.01); G06F 40/205 (2020.01)] 31 Claims
OG exemplary drawing
 
1. A method of generating a searchable representation of an electronic document, the method comprising:
obtaining an electronic document that includes format data specifying a graphical layout of content items, the content items including unstructured text and structured at least text;
determining pixel data representing the graphical layout of the content items;
providing input data based, at least in part, on the pixel data to a document parsing model that is trained to:
detect, within the graphical layout based on the input data, functional regions, the functional regions including first functional regions corresponding to the unstructured text and second functional regions corresponding to the structured text;
assign, based on the input data, first boundaries to the first functional regions and second boundaries to the second functional regions; and
assign a first category label to each first functional region and a second category label to each second functional region;
matching first portions of the unstructured text to corresponding first category labels of first functional regions based on the first boundaries and locations associated with the first portions of the unstructured text;
matching second portions of structured text to corresponding second category labels of second functional regions based on the second boundaries and locations associated with the second portions of structured text; and
storing each first category label and corresponding first portions of unstructured text and each second category label and corresponding second portions of structured text as document data representing the content items in a searchable data structure, wherein the searchable data structure includes node elements for the first category labels and the second category labels.