US 11,837,005 B2
Machine learning based end-to-end extraction of tables from electronic documents
Sunil Reddy Tiyyagura, Bengaluru (IN); and Amani Kongara, Bengaluru (IN)
Assigned to EYGS LLP, London (GB)
Filed by EYGS LLP, London (GB)
Filed on Feb. 22, 2023, as Appl. No. 18/172,461.
Application 18/172,461 is a continuation of application No. 16/781,195, filed on Feb. 4, 2020, granted, now 11,625,934.
Prior Publication US 2023/0237828 A1, Jul. 27, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 30/414 (2022.01); G06F 40/295 (2020.01); G06V 30/10 (2022.01)
CPC G06V 30/414 (2022.01) [G06F 40/295 (2020.01); G06V 30/10 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A processor-readable non-transitory medium storing code representing instructions to be executed by a processor, the code comprising code to cause the processor to:
identify a plurality of word bounding boxes in a first electronic document, each word bounding box from the plurality of word bounding boxes associated with a word from a plurality of words in the first electronic document, the first electronic document including a table having a set of table cells positioned in a set of rows and a set of columns;
identify, based on a horizontal position and a vertical position of at least two corners of each word bounding box from the plurality of word bounding boxes, locations of horizontal gaps between two adjacent rows from the set of rows;
determine, using a machine learning algorithm and based on (1) the locations of horizontal gaps and (2) a plurality of entity names from the table, a class from a set of classes for each row from the set of rows in the table;
extract a set of table cell values associated with the set of table cells based on, for each row from the set of rows, the class from the set of classes for that row; and
generate a second electronic document including the set of table cell values arranged in the set of rows and the set of columns and based on (1) the locations of horizontal gaps, (2) locations of vertical gaps, such that the plurality of words in the table are computer-readable in the second electronic document.