US 12,216,689 B2
System and method for generating and implementing context weighted words
Robert J. Rappa, Manalapan, NJ (US); Pedro A. Reyes, Bronx, NY (US); Erica J. Kim, Brooklyn, NY (US); Chen Trilnik, New York, NY (US); Michael Harmon, New York, NY (US); and Stephen Farrell, Lincoln University, PA (US)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMorgan Chase Bank, N.A., New York, NY (US)
Filed on Sep. 21, 2020, as Appl. No. 17/026,805.
Claims priority of provisional application 62/903,033, filed on Sep. 20, 2019.
Prior Publication US 2021/0089562 A1, Mar. 25, 2021
Int. Cl. G06F 16/31 (2019.01); G06F 9/46 (2006.01); G06F 16/35 (2019.01); G06F 16/36 (2019.01); G06F 40/289 (2020.01)
CPC G06F 16/313 (2019.01) [G06F 9/466 (2013.01); G06F 16/3334 (2019.01); G06F 16/35 (2019.01); G06F 16/367 (2019.01); G06F 40/289 (2020.01)] 18 Claims
OG exemplary drawing
 
1. A system for improving data transaction integrity by utilizing context weighted words to improve an accuracy of an identification of parties within transactions, the system comprising:
an input configured to receive input data relating to the transactions;
a memory configured to store and manage context data and the transaction data; and
a computer processor coupled to the input and the memory and further configured to:
receive the input data comprising a set of transaction data for a plurality of transactions made at a point-of-sale device, the transaction data identifying a merchant and further including electronic funds transfer data;
apply a pre-processing process and a string cleaning process to create a list of unique company names from the set of transaction data, wherein the string cleaning process includes string normalization;
identify a context relating to the set of transaction data, the context comprising a determination of to what the transaction data pertains based on one or more characteristics that are common to the transaction data;
calculate word frequencies for each unique company name from the list of unique company names based on the context, wherein the calculating word frequencies further comprises calculating, in terms of standard deviation, a distance of each of the word frequencies from mean of the word frequencies;
list, within a table, each of the unique company names by rank according to the word frequencies for each unique company name from the list of unique company names, wherein the table comprises: each of the unique company names, the word frequencies for each unique company name from the list of unique company names, a rank for each of the unique company names, and the distance of each of the word frequencies from mean of the word frequencies;
utilize the table to eliminate an artificial inflation of a similarity between different unique company names having matching words by:
creating weights for each of the unique company names in the list from the set of transaction data based on a corresponding calculated word frequency in light of the context, wherein the weighting is inverse to a defined corpus of words, and wherein no weight is given to each of the word frequencies whose distance from the mean is less than three times the standard deviation;
applying a string similarity algorithm for each transaction against candidates, using the weights for each unique company name, in order to identify a best matching record from among each of the unique company names, wherein the string similarity algorithm comprises at least one from among a Jaccard algorithm and an Edit-Distance based string similarity algorithm; and
identifying the best matching record;
generate, by applying the best matching record to at least one from among at least one merchant tag of the input data and at least one counterparty tag of the input data, enriched input data that identifies at least a corporate hierarchy;
create a repository of corresponding counterparties by leveraging internal reference data and the input data relating to the transactions; and
provide, to at least one from among the merchant and at least one corresponding counterparty, the enriched input data that identifies at least the corporate hierarchy.