US 12,299,410 B2
String similarity based weighted min-hashing
Allon Adir, Kiryat Tivon (IL); Ehud Aharoni, Kfar Saba (IL); Omri Soceanu, Haifa (IL); and Michael Mirkin, Tivon (IL)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Jun. 30, 2022, as Appl. No. 17/853,996.
Prior Publication US 2024/0004610 A1, Jan. 4, 2024
Int. Cl. G06F 16/90 (2019.01); G06F 7/02 (2006.01); G06F 16/14 (2019.01); G06F 16/31 (2019.01); G06F 16/903 (2019.01); G06F 16/2458 (2019.01)
CPC G06F 7/02 (2013.01) [G06F 16/152 (2019.01); G06F 16/313 (2019.01); G06F 16/90344 (2019.01); G06F 16/2458 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for generating hash values to determine string similarity, the computer-implemented method comprising:
converting a first text string of a first data set into a first set of shingles;
determining a weight associated with each shingle in the first set of shingles, based, at least in part, on a particular record field associated with the shingle;
generating, based on a hash function, a hash value for each shingle in the first set of shingles;
reducing the hash value generated for each shingle in the first set of shingles, based, at least in part, on the weight associated with the shingle;
computing a hash signature value using the reduced hash value;
determining that two data records intersect according to the hash signature value; and
storing information about the intersection over a network, in a record database.