| CPC G06F 11/1453 (2013.01) [G06F 11/1435 (2013.01); G06F 11/1464 (2013.01); G06F 16/174 (2019.01); G06F 18/2113 (2023.01); G06F 18/22 (2023.01); G06F 18/23 (2023.01)] | 20 Claims |

|
1. A method, comprising:
generating a fingerprint: tag dictionary that comprises a plurality of pairs, wherein each pair includes a fingerprint and a list of tags, which include the fingerprint, wherein each tag is assigned to one or more fingerprints;
computing one or more similarity matrixes based on every pair of two tags in the fingerprint:tag dictionary, wherein each similarity matrix identifies a relative similarity between a first list of fingerprints assigned to one of the two tags and a second list of fingerprints assigned to the other one of the two tags;
running a clustering algorithm to identify groups of similar tags based on the one or more similarity matrixes; and
deduplicating, based on the groups of similar tags, respective data associated with the fingerprints,
wherein at least one of the tags includes 10,000 fingerprints, which are generated by a hashing process.
|