US 12,217,523 B2
List and tabular data extraction system and method
Andre Chatzistamatiou, Essen (DE); Florin Cremenescu, Biot (FR); Yizhen Dai, The Hague (NL); and Ludo Gerardus Wilhelmus van Alst, Nijmegen (NL)
Assigned to Accenture Global Solutions Limited, Dublin (IE)
Filed by Accenture Global Solutions Limited, Dublin (IE)
Filed on Aug. 29, 2022, as Appl. No. 17/898,193.
Claims priority of application No. 22305866 (EP), filed on Jun. 14, 2022.
Prior Publication US 2023/0410543 A1, Dec. 21, 2023
Int. Cl. G06V 30/413 (2022.01); G06F 16/906 (2019.01); G06F 16/93 (2019.01); G06F 18/20 (2023.01); G06F 18/24 (2023.01); G06F 40/106 (2020.01); G06F 40/177 (2020.01); G06F 40/258 (2020.01); G06F 40/279 (2020.01); G06F 40/30 (2020.01); G06V 30/10 (2022.01); G06V 30/146 (2022.01); G06V 30/148 (2022.01); G06V 30/162 (2022.01); G06V 30/164 (2022.01); G06V 30/412 (2022.01); G06V 30/414 (2022.01)
CPC G06V 30/413 (2022.01) [G06F 16/906 (2019.01); G06F 16/93 (2019.01); G06F 18/20 (2023.01); G06F 18/24 (2023.01); G06F 40/106 (2020.01); G06F 40/177 (2020.01); G06F 40/258 (2020.01); G06F 40/279 (2020.01); G06F 40/30 (2020.01); G06V 30/10 (2022.01); G06V 30/1463 (2022.01); G06V 30/1475 (2022.01); G06V 30/153 (2022.01); G06V 30/162 (2022.01); G06V 30/164 (2022.01); G06V 30/412 (2022.01); G06V 30/414 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method of tabular or list-based data extraction from document images, the method comprising:
receiving, at a server and from a first data source, a first document including a first page;
performing a column-wise pixel analysis of the first page, thereby determining that the first page includes a first table;
performing column segmentation based on signal analysis of column-wise mean pixel values of the first page, thereby identifying a set of columns;
performing row segmentation using optical character recognition (OCR)-generated bounding boxes, thereby identifying a set of rows;
selecting which rows of the set of rows belong to the first table using a first Conditional Random Fields (CRF) model, thereby localizing the first table on the first page;
selecting, for each column in the set of columns, a header name from a pre-defined set of header names, the selection being based on a classification performed by a second CRF model that evaluates at least the entire contents of that column;
mapping each item of data extracted from a cell in the first table to a field using semantic data understanding; and
generating a first digital table representing data extracted from the first table for presentation in a user interface.