CPC G06F 16/316 (2019.01) [G06F 40/117 (2020.01); G06F 40/137 (2020.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06V 30/412 (2022.01); G06V 30/414 (2022.01)] | 20 Claims |
1. A method comprising:
obtaining, at a device, a hierarchical structure representing a graphical layout of content items of an electronic document, the content items including at least text;
generating a word embedding representing a word of the electronic document;
determining position information of a location of the word in the electronic document;
determining a descriptor that indicates a relationship of the location to the hierarchical structure;
providing input data to a machine learning model to generate a semantic region category label of a semantic region of the electronic document, the semantic region including the word, wherein the input data includes the word embedding, the position information, and the descriptor; and
generating a character index selector indicating characters of the electronic document that are associated with the semantic region, the character index selector indicating one or more ranges of character indices in a character listing for the electronic document, wherein the character index selector indicates multiple ranges of character indices in the character listing, and wherein a gap between a first range of the multiple ranges and each remaining range of the multiple ranges indicates that the semantic region includes discontinuous text.
|