US 12,242,982 B1
Method of using clusters to train supervised entity resolution in big data
George Anwar Dany Beskales, Waltham, MA (US); Pedro Giesemann Cattori, Cambridge, MA (US); Alexandra V. Batchelor, Cambridge, MA (US); Brian A. Long, Somerville, MA (US); and Nikolaus Bates-Haus, Littleton, MA (US)
Assigned to TAMR, INC., Cambridge, MA (US)
Filed by Tamr, Inc., Cambridge, MA (US)
Filed on Jun. 25, 2021, as Appl. No. 17/358,766.
Application 17/358,766 is a continuation of application No. 17/196,558, filed on Mar. 9, 2021, granted, now 11,049,028.
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 20/00 (2019.01); G06F 16/28 (2019.01); G06N 5/04 (2023.01)
CPC G06N 5/04 (2013.01) [G06F 16/285 (2019.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method of record clustering comprising:
(a) providing a collection of records, where each record in the collection has
(i) a current cluster membership, and
(ii) a proposed cluster membership,
in which some of the current cluster memberships may also be verified cluster memberships;
(b) requesting, via an interface, a subset of the collection of records and their respective current, proposed, and verified cluster memberships;
(c) indicating, via the interface, for a record in the subset of the collection of records, whether the record is a member of a cluster, thereby resulting in a revised current cluster membership and a revised verified cluster membership for the record;
(d) storing in memory for each record in the collection of records, the revised current cluster membership and the revised verified cluster membership;
(e) creating, using software code executing in a processor, inferred match training labels from the revised current cluster memberships and the revised verified cluster memberships, the inferred match training labels including pairs of records from the collection of records, each pair of records having a match label;
(f) training, using the software code executing in the processor, a pair-wise classifier using the inferred match training labels; and
(g) generating, using the software code executing in the processor, a new proposed cluster membership for each record in the collection of records using the trained pair-wise classifier and the revised verified cluster memberships, thereby producing a record clustering.