US 12,032,605 B2
Searchable data structure for electronic documents
William McNeill, Austin, TX (US)
Assigned to SPARKCOGNITION, INC., Austin, TX (US)
Filed by SparkCognition, Inc., Austin, TX (US)
Filed on Nov. 11, 2022, as Appl. No. 18/054,787.
Claims priority of provisional application 63/279,394, filed on Nov. 15, 2021.
Prior Publication US 2023/0153335 A1, May 18, 2023
Int. Cl. G06F 16/31 (2019.01); G06F 40/117 (2020.01); G06F 40/137 (2020.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06V 30/412 (2022.01); G06V 30/414 (2022.01)
CPC G06F 16/316 (2019.01) [G06F 40/117 (2020.01); G06F 40/137 (2020.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06V 30/412 (2022.01); G06V 30/414 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
obtaining, at a device, a hierarchical structure representing a graphical layout of content items of an electronic document, the content items including at least text;
generating a word embedding representing a word of the electronic document;
determining position information of a location of the word in the electronic document;
determining a descriptor that indicates a relationship of the location to the hierarchical structure;
providing input data to a machine learning model to generate a semantic region category label of a semantic region of the electronic document, the semantic region including the word, wherein the input data includes the word embedding, the position information, and the descriptor; and
generating a character index selector indicating characters of the electronic document that are associated with the semantic region, the character index selector indicating one or more ranges of character indices in a character listing for the electronic document, wherein the character index selector indicates multiple ranges of character indices in the character listing, and wherein a gap between a first range of the multiple ranges and each remaining range of the multiple ranges indicates that the semantic region includes discontinuous text.