US 12,229,780 B2
	Embedding service for unstructured data
Runhua Zhao, Milpitas, CA (US); Vinay Patlolla, Daly City, CA (US); Nikolas Terani, Los Angeles, CA (US); Taylor J. Cressy, Los Angles, CA (US); and Henry Venturelli, Los Angeles, CA (US)
Assigned to Inuit Inc., Mountain View, CA (US)
Filed by Intuit Inc., Mountain View, CA (US)
Filed on Jul. 30, 2021, as Appl. No. 17/389,532.
Prior Publication US 2023/0035639 A1, Feb. 2, 2023
Int. Cl. G06Q 40/00 (2023.01); G06F 16/338 (2019.01); G06Q 20/40 (2012.01); G06N 20/00 (2019.01)

CPC G06Q 20/4016 (2013.01) [G06F 16/3347 (2019.01); G06F 16/338 (2019.01); G06F 16/353 (2019.01); G06N 20/00 (2019.01)]

12 Claims

1. A method for detecting fraud in a transaction record, comprising:

receiving, from a software application, by an online fraud determination service, the transaction record, wherein the record comprises an untransformed transaction including a first unstructured data, wherein the online fraud determination service includes an embedding model, a cluster model, a query generator, a transaction transformer, and a fraud determination model;

generating, by the embedding model corresponding to the first unstructured data, a first vector from the first unstructured data included in the untransformed transaction, wherein the embedding model is trained to generate vectors from the first unstructured data;

receiving, by the cluster model, from the embedding model, the first vector corresponding to the first unstructured data;

assigning, by the cluster model, for the first vector, a first cluster ID by matching the first vector with a first matching cluster vector,

wherein the first cluster ID is based on a cluster of vectors within a threshold distance of a centroid of the cluster of vectors,

the centroid represents an average of the vectors in that cluster,

wherein the first cluster ID is one of a first set of cluster ID's in which each cluster ID in the first set is expressed in a fixed format comprising an integer or alphanumeric string, and

the cluster model is trained to cluster vectors from the first unstructured data;

generating, by a query generator, a first query using the first cluster ID and the untransformed transaction;

generating, using the first query, a query result from a plurality of features of prior transactions stored in a feature store, wherein the features are generated from a plurality of prior transformed transactions, wherein each transformed transaction comprises one or more cluster IDs;

transforming, by a transaction transformer, using a plurality of cluster IDs generated by the cluster model, wherein the transformed transactions are generated from a plurality of untransformed transactions by transforming them to a plurality of transformed transactions,

wherein the plurality of transformed transactions each comprise a cluster ID, and

wherein the plurality of untransformed transactions comprises the untransformed transaction, and

wherein transforming each of the untransformed transactions comprises replacing the first unstructured data from the untransformed transaction with the first cluster ID, assigned to the first vector generated from the first unstructured data, in a transformed transaction of the plurality of transformed transactions;

generating, by a feature generator, a plurality of features from the plurality of transformed transactions and storing, by the feature generator, the plurality of features in a database,

wherein the plurality of features comprise cluster-derived features including cluster ID's expressed in the fixed format;

applying the fraud determination model to the query result to generate a fraud score for the transformed transaction,

wherein the fraud determination model has been trained on the cluster-derived features expressed in the fixed format and on other non-cluster derived features,

wherein the fraud score is based on the query result and indicates a probability that the transformed transaction is fraudulent, and

wherein the fraud determination model is trained on untransformed transactions and a combination of features derived from training transactions labeled as fraudulent or valid;

presenting the fraud score and the first cluster ID to a user of the software application; and

updating the cluster model to add or delete or modify the clusters to generate a second set of cluster ID's, wherein the second set of cluster ID's is expressed in the fixed format, whereby generating the second set of cluster ID's does not affect the input or output of the fraud determination model.