US 11,810,383 B2
System and method for determination of label values in unstructured documents
Devang Jagdishchandra Patel, Mumbai (IN); Prabhat Ranjan Mishra, Mumbai (IN); Ketkee Pandit, Mumbai (IN); Ankita Gupta, Mumbai (IN); Chirabrata Bhaumik, Kolkata (IN); Dinesh Yadav, Mumbai (IN); and Amit Kumar Agrawal, Kolkata (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Nov. 20, 2020, as Appl. No. 17/100,205.
Claims priority of application No. 201921047655 (IN), filed on Nov. 21, 2019.
Prior Publication US 2021/0201018 A1, Jul. 1, 2021
Int. Cl. G06V 30/414 (2022.01); G06F 40/279 (2020.01); G06V 30/416 (2022.01); G06F 18/22 (2023.01); G06V 30/10 (2022.01); G06F 16/958 (2019.01); G06N 20/00 (2019.01); G06F 40/30 (2020.01); G06N 5/04 (2023.01); G06Q 30/0201 (2023.01); G06F 18/214 (2023.01); G06Q 30/016 (2023.01)
CPC G06V 30/414 (2022.01) [G06F 16/958 (2019.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06F 40/279 (2020.01); G06F 40/30 (2020.01); G06N 5/04 (2013.01); G06N 20/00 (2019.01); G06Q 30/0201 (2013.01); G06V 30/416 (2022.01); G06Q 30/016 (2013.01); G06V 30/10 (2022.01)] 19 Claims
OG exemplary drawing
 
1. A processor-implemented method for determining label value for labels in unstructured documents, the method comprising:
defining, via one or more hardware processors, an extraction profile comprising a set of labels for which label values are to be extracted from the unstructured document;
identifying, via the one or more hardware processors, a plurality of sections in one or more page images of the unstructured document, each section of the plurality of sections identified based on one or more image processing techniques;
generating, via the one or more hardware processors, a plurality of bounding boxes in the one or more page images, each of the plurality of bounding boxes enclosing a section of the plurality of sections;
obtaining, via the one or more hardware processors, a label value for each label from amongst the plurality of labels stored in the extraction profile, wherein obtaining the label value for each label comprises:
extracting the plurality of labels, via the one or more hardware processors, wherein extracting a label comprises performing for each bounding box of the plurality of bounding boxes;
extracting text comprised in the bounding box, features of the bounding box, and an Optical character recognition (OCR) confidence score (COCR) associated with the text based on a confidence score associated with each word of the text using an OCR technique;
determining whether a label text for a label from amongst the plurality of labels is present in the bounding box, the label text for the label comprising one of a label name and one or more synonyms for the label name;
on determination of absence of the label in the bounding box, applying an OCR error correction model and a partial matching model, wherein the OCR error correction model utilizes a minimum distance technique to identify inaccuracy in the text identified through the OCR technique, and the partial matching model computes a level of matching between the text identified using the OCR error correction model and the label from amongst the plurality of labels; and
extracting the label from the bounding box on determination of the level of matching between the text identified using the OCR error correction model and the label from amongst the plurality of labels being more than or equal to a predefined threshold;
identifying, from amongst the plurality of bounding boxes, a bounding box comprising label value corresponding to the label, the bounding box being one of the bounding boxes comprising the label text and a value matching data type criteria for the label in the bounding box and a neighboring bounding box containing value matching data type criteria in vicinity of the bounding box comprising the label text using a nearest proximity neighbor criteria;
predicting, via the one or more hardware processors, a bounding box comprising the label value associated with the label text using a deep learning model, the deep learning model trained with location information and data type criteria associated with the label values of the labels; and
obtaining, via the one or more hardware processors, an aggregate confidence score for the text in the bounding box indicative of the text being a label value for the label in the bounding box, the aggregate confidence score obtained as a weighted sum of a confidence score (CPOS) of identification of position of the bounding box comprising label value in comparison with the position of label value obtained from deep learning model, a confidence score (CPROXIMITY) of the extracted value of the label in a neighboring bounding box, a confidence score (CSIZE) associated with size of the bounding box, and the COCR associated with the OCR.