US 11,657,071 B1
Mapping disparate datasets
Kamila Rywelska, San Francisco, CA (US); Carleton J. Lindgren, San Francisco, CA (US); Manesh Saini, New York, NY (US); and Hasan Adem Yilmaz, San Diego, CA (US)
Assigned to Wells Fargo Bank N.A., San Francisco, CA (US)
Filed by Wells Fargo Bank N.A., San Francisco, CA (US)
Filed on Dec. 1, 2020, as Appl. No. 17/109,027.
Claims priority of provisional application 62/989,479, filed on Mar. 13, 2020.
Int. Cl. G06F 16/28 (2019.01); G06F 16/2457 (2019.01)
CPC G06F 16/288 (2019.01) [G06F 16/24578 (2019.01); G06F 16/285 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer system comprising:
a memory storing instructions; and
a processor configured to execute the instructions to perform operations comprising:
retrieving, from a first database of a first system, a plurality of first data entries, each of the first data entries comprising a first plurality of data fields;
retrieving, from a second database of a second system, a plurality of second data entries, each of the second data entries comprising a second plurality of data fields;
performing a machine learning operation on the first and second pluralities of data entries, the machine learning operation comprising:
selecting a plurality of text analytics operations from a set of operations consisting of latent semantic index (LSI), cosine similarity, FuzzyWuzzy operations, topic modeling with network regularization, word2vec, soft cosine similarity, doc2vec, latent Dirichlet allocation (LDA), Jensen-Shannon divergence, and Word Mover's Distance (WMD);
generating, by applying the set of selected text analytics operations to the first and second pluralities of data fields, a first number of similarity scores for the first and second pluralities of data fields; and
determining that a second number of the similarity scores exceed one or more thresholds;
verifying a number of topics in a set of topics based on a comparison of the number of topics to a result of an n-gram analysis of the machine learning operation, wherein the number of topics indicates how many topics are in the set of topics;
modifying, based on the verification, at least one of (i) the set of topics, or (ii) the number of topics;
generating, based on (i) the second number of the similarity scores satisfying the one or more thresholds, and (ii) at least one of the set of topics or the number of topics, a mapping report connecting a first subset of the first data entries in the first database to a second subset of the second data entries in the second database; and
displaying, on a display device, the mapping report indicating which ones of the first data entries are connected to which ones of the second data entries;
wherein the mapping report is updated based on a user feedback, wherein the user feedback comprises one or more additions to data field descriptions by one or more users, and wherein one or more connections corresponding to the mapping report are determined to be incorrect due to missing data field entries.