CPC G06F 40/16 (2020.01) [G06F 40/154 (2020.01); G06N 20/00 (2019.01); G06V 30/413 (2022.01)] | 20 Claims |
1. A method comprising:
obtaining a set of tagged pdf documents, each having a corresponding a document object model (DOM) structure;
determining, for each tagged pdf document, relationships between graphical objects in the tagged pdf document and corresponding elements of the DOM structure of the tagged pdf document;
generating, for each tagged pdf document, a corresponding training record identifying the determined relationships;
training, using the training records, a machine learning model to determine DOM structure elements that are associated with graphical objects;
obtaining a set of untagged PDF documents which do not contain corresponding DOM structures; and
for each untagged PDF document, automatically generating, using the trained machine learning model, a corresponding tagged PDF document having one or more DOM structure elements corresponding to one or more graphical objects contained in the untagged PDF document.
|