| CPC G06F 16/215 (2019.01) [G06F 16/9024 (2019.01); G06V 30/19093 (2022.01); G06V 30/412 (2022.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
obtaining, by a de-duplication server, a candidate pair of a plurality of digitally stored documents from a document database, identifying text elements from each digitally stored document in the candidate pair in response, and storing the text elements as document extraction attributes;
automatically computing and storing, by the de-duplication server, relative positional differences of the text elements between each digitally stored document of the candidate pair and a document similarity score based on the relative positional differences;
comparing, by the de-duplication server, the relative positional differences with a similarity function to form a difference similarity vector for the candidate pair, wherein the difference similarity vector comprises components corresponding to each relative positional difference;
aggregating the components of the difference similarity vector to determine a final score for the candidate pair;
determining a document-level similarity metric from the final score;
determining, by the de-duplication server, whether the final score is above a cutoff value, and in response to determining that the final score for the candidate pair is above the cutoff value, comparing the document extraction attributes with the final score;
determining whether the document-level similarity metric is above a threshold value by the de-duplication server;
classifying the candidate pair based on determining that the document-level similarity metric is above the threshold value to de-duplicate the plurality of digitally stored documents in the candidate pair;
based on classifying, removing duplicate transaction documents from the document database by any of deleting records, marking records, updating column attributes, or writing records to a different table.
|