US 11,893,817 B2
Method and system for generating document field predictions
Paulo Abelha Ferreira, Rio de Janeiro (BR); Pablo Nascimento da Silva, Niterói (BR); Rômulo Teixeira de Abreu Pinho, Niterói (BR); Tiago Salviano Calmon, London (GB); and Vinicius Michel Gottin, Rio de Janeiro (BR)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Jul. 27, 2021, as Appl. No. 17/386,386.
Prior Publication US 2023/0031202 A1, Feb. 2, 2023
Int. Cl. G06F 30/00 (2020.01); G06V 30/412 (2022.01); G06F 16/35 (2019.01); G06V 30/413 (2022.01); G06V 30/414 (2022.01); G06F 18/214 (2023.01)
CPC G06V 30/412 (2022.01) [G06F 16/35 (2019.01); G06F 18/214 (2023.01); G06V 30/413 (2022.01); G06V 30/414 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method for predicting field values of documents, the method comprising:
identifying, by a document annotator, a field prediction model generation request;
in response to identifying the field prediction model generation request:
obtaining, by the document annotator, training documents from a document manager;
selecting, by the document annotator, a first training document of the training documents;
making a first determination, by the document annotator, that the first training document is a text-based document; and
in response to the first determination:
performing, by the document annotator, text-based data extraction to identify first words and first boxes included in the first training document;
identifying, by the document annotator, first keywords and first candidate words included in the first training document based on the first words and first boxes, wherein the first keywords specify words associated with a field, and wherein the first candidate words specify potential field values of the field; and
generating, by the document annotator, a first annotated training document using the first keywords and the first candidate words, wherein the first annotated training document comprises color-based representation masks for the first keywords, the first candidate words, and first general words included in the first training document.