CPC G06F 16/93 (2019.01) [G06F 16/325 (2019.01); G06F 16/35 (2019.01); G06F 40/284 (2020.01)] | 20 Claims |
1. A method comprising:
generating, by processing hardware, an ordered token list comprising tokens representing a plurality of character strings from a digital document ordered based on a frequency of occurrence of the tokens in connection with the digital document;
generating, by the processing hardware, a document signature for the digital document by:
selecting a first subset of tokens from the ordered token list, the first subset of tokens corresponding to character strings having frequencies of occurrence in the digital document above a frequency threshold, wherein the frequency threshold indicates a specific number of occurrences of a lemmatized word in the digital document;
selecting a second subset of tokens from the ordered token list, the second subset of tokens corresponding to character strings having frequencies of occurrence in the digital document below the frequency threshold; and
concatenating the first subset of tokens and the second subset of tokens into a token sequence;
generating a hash value for the digital document from an additional subset of tokens selected from the first subset of tokens corresponding to the character strings having frequencies of occurrence in the digital document above the frequency threshold;
determining, by the processing hardware accessing a plurality of digital documents stored at a digital content database, a cluster of similar digital documents by utilizing the hash value of the digital document to compare the digital document to the plurality of digital documents according to hash values of the plurality of digital documents; and
determining a similarity of the digital document to one or more additional digital documents in the cluster of similar digital documents by comparing tokens of the document signature of the digital document and tokens of one or more document signatures of the one or more additional digital documents.
|