US 11,886,820 B2
System and method for machine-learning based extraction of information from documents
Sreekanth Menon, Bangalore (IN); Prakash Selvakumar, Bangalore (IN); and Sudheesh Sudevan, Thalassery (IN)
Assigned to Genpact Luxembourg S.à r.l. II, Luxembourg (LU)
Filed by Genpact Luxembourg S.à r.l. II, Luxembourg (LU)
Filed on Oct. 6, 2020, as Appl. No. 17/064,028.
Prior Publication US 2022/0108073 A1, Apr. 7, 2022
Int. Cl. G06F 40/295 (2020.01); G06N 20/00 (2019.01); G06F 18/214 (2023.01); G06F 18/21 (2023.01); G06F 18/23213 (2023.01)
CPC G06F 40/295 (2020.01) [G06F 18/2148 (2023.01); G06F 18/2155 (2023.01); G06F 18/2193 (2023.01); G06F 18/23213 (2023.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method for training a machine-learning (ML) system, the method comprising:
(a) providing a seed set of labeled entities as a labeled entities set based on a first cluster of a plurality of clusters of documents and using the labeled entities set to train the ML system, to obtain an ML model;
(b) using the trained ML system to predict labels for entities in an unlabeled entities set, yielding a machine-labeled entities set, the prediction providing a respective confidence score for each machine-labeled entity;
(c) selecting from the machine-labeled entities set, a subset of machine-labeled entities having a respective confidence score at least equal to a threshold confidence score;
(d) updating the labeled entities set by adding thereto the selected subset of machine- labeled entities;
(e) removing from the machine-labeled entities set the selected subset of machine-labeled entities and deleting labels assigned to the entities in the updated machine-labeled entities set to provide the unlabeled entities set for a next iteration;
(f) if a termination condition is not reached, repeating steps (a) through (e), and, otherwise, storing the ML model;
(g) selecting a second cluster from the plurality of clusters; and
(h) repeating the steps (a) through (f) for the second cluster to store a different ML model for the second cluster, wherein providing the seed set in step (a) is based on the second cluster.