CPC G06V 30/413 (2022.01) [G06F 16/906 (2019.01); G06F 16/93 (2019.01); G06F 18/20 (2023.01); G06F 18/24 (2023.01); G06F 40/106 (2020.01); G06F 40/177 (2020.01); G06F 40/258 (2020.01); G06F 40/279 (2020.01); G06F 40/30 (2020.01); G06V 30/10 (2022.01); G06V 30/1463 (2022.01); G06V 30/1475 (2022.01); G06V 30/153 (2022.01); G06V 30/162 (2022.01); G06V 30/164 (2022.01); G06V 30/412 (2022.01); G06V 30/414 (2022.01)] | 20 Claims |
1. A computer-implemented method of tabular or list-based data extraction from document images, the method comprising:
receiving, at a server and from a first data source, a first document including a first page;
performing a column-wise pixel analysis of the first page, thereby determining that the first page includes a first table;
performing column segmentation based on signal analysis of column-wise mean pixel values of the first page, thereby identifying a set of columns;
performing row segmentation using optical character recognition (OCR)-generated bounding boxes, thereby identifying a set of rows;
selecting which rows of the set of rows belong to the first table using a first Conditional Random Fields (CRF) model, thereby localizing the first table on the first page;
selecting, for each column in the set of columns, a header name from a pre-defined set of header names, the selection being based on a classification performed by a second CRF model that evaluates at least the entire contents of that column;
mapping each item of data extracted from a cell in the first table to a field using semantic data understanding; and
generating a first digital table representing data extracted from the first table for presentation in a user interface.
|