| CPC G06V 30/412 (2022.01) [G06V 30/186 (2022.01); G06V 30/19153 (2022.01); G06V 30/19167 (2022.01)] | 20 Claims |

|
13. A method of data processing comprising:
accessing an image of a document including a plurality of data units;
obtaining a document image by converting the document into an image format;
implementing a connected components process that analyzes the document image as a series of sub-graphs;
determining that the plurality of data units includes at least one floating image based on the connected components process;
disregarding the at least one floating image from further processing;
identifying serially, one of a structured data unit and unstructured floating text from a first masked image and a second masked image generated from the first masked image;
identifying corresponding regions of the document image including one or more of the structured data unit and the unstructured floating text;
obtaining optical character recognition (OCR) input from the corresponding document image regions including the one or more of the structured data unit and the unstructured floating text,
wherein the OCR input includes textual data obtained based on a semantic context derived from logical boundaries defined by the corresponding document image regions; and
generating machine-consumable data set including entities extracted from the OCR input.
|