US 12,222,815 B2
	Efficient dictionary data structure to find similar backup clients
Smriti Thakkar, San Jose, CA (US); Tony T. Wong, Milpitas, CA (US); and Abhinav Duggal, Jersey City, NJ (US)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Dec. 17, 2020, as Appl. No. 17/125,536.
Prior Publication US 2022/0197755 A1, Jun. 23, 2022
Int. Cl. G06F 11/14 (2006.01); G06F 16/174 (2019.01); G06F 18/2113 (2023.01); G06F 18/22 (2023.01); G06F 18/23 (2023.01)

CPC G06F 11/1453 (2013.01) [G06F 11/1435 (2013.01); G06F 11/1464 (2013.01); G06F 16/174 (2019.01); G06F 18/2113 (2023.01); G06F 18/22 (2023.01); G06F 18/23 (2023.01)]

20 Claims

1. A method, comprising:

generating a fingerprint: tag dictionary that comprises a plurality of pairs, wherein each pair includes a fingerprint and a list of tags, which include the fingerprint, wherein each tag is assigned to one or more fingerprints;

computing one or more similarity matrixes based on every pair of two tags in the fingerprint:tag dictionary, wherein each similarity matrix identifies a relative similarity between a first list of fingerprints assigned to one of the two tags and a second list of fingerprints assigned to the other one of the two tags;

running a clustering algorithm to identify groups of similar tags based on the one or more similarity matrixes; and

deduplicating, based on the groups of similar tags, respective data associated with the fingerprints,

wherein at least one of the tags includes 10,000 fingerprints, which are generated by a hashing process.