US 12,287,767 B2
De-duplicating transaction records using targeted fuzzy matching
Jyotirmaya Mahanta, Pune (IN); Ankit Narang, Pune (IN); Shoan Jain, Berkeley, CA (US); and Prasanna Kumar, Hyderabad (IN)
Assigned to Coupa Software Incorporated, San Mateo, CA (US)
Filed by Coupa Software Incorporated, San Mateo, CA (US)
Filed on Jan. 30, 2024, as Appl. No. 18/427,309.
Claims priority of provisional application 63/483,357, filed on Feb. 6, 2023.
Prior Publication US 2024/0264989 A1, Aug. 8, 2024
Int. Cl. G06F 16/215 (2019.01); G06F 16/901 (2019.01); G06V 30/19 (2022.01); G06V 30/412 (2022.01)
CPC G06F 16/215 (2019.01) [G06F 16/9024 (2019.01); G06V 30/19093 (2022.01); G06V 30/412 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
obtaining, by a de-duplication server, a candidate pair of a plurality of digitally stored documents from a document database, identifying text elements from each digitally stored document in the candidate pair in response, and storing the text elements as document extraction attributes;
automatically computing and storing, by the de-duplication server, relative positional differences of the text elements between each digitally stored document of the candidate pair and a document similarity score based on the relative positional differences;
comparing, by the de-duplication server, the relative positional differences with a similarity function to form a difference similarity vector for the candidate pair, wherein the difference similarity vector comprises components corresponding to each relative positional difference;
aggregating the components of the difference similarity vector to determine a final score for the candidate pair;
determining a document-level similarity metric from the final score;
determining, by the de-duplication server, whether the final score is above a cutoff value, and in response to determining that the final score for the candidate pair is above the cutoff value, comparing the document extraction attributes with the final score;
determining whether the document-level similarity metric is above a threshold value by the de-duplication server;
classifying the candidate pair based on determining that the document-level similarity metric is above the threshold value to de-duplicate the plurality of digitally stored documents in the candidate pair;
based on classifying, removing duplicate transaction documents from the document database by any of deleting records, marking records, updating column attributes, or writing records to a different table.