US 12,141,806 B2
	Clustering-based data selection for optimization of risk predictive machine learning models
Danny Butvinik, Haifa (IL); Maria Zatsepin, Qyriat Ono (IL); and Yoav Avneon, Ness-Zyiona (IL)
Assigned to ACTIMIZE LTD., Ra'anana (IL)
Filed by Actimize LTD., Ra'anana (IL)
Filed on May 30, 2021, as Appl. No. 17/334,743.
Prior Publication US 2022/0383322 A1, Dec. 1, 2022
Int. Cl. G06N 20/00 (2019.01); G06N 5/04 (2023.01); G06Q 20/40 (2012.01); G06F 18/214 (2023.01); G06F 18/23 (2023.01); G06F 18/24 (2023.01)

CPC G06Q 20/4016 (2013.01) [G06N 5/04 (2013.01); G06N 20/00 (2019.01); G06F 18/214 (2023.01); G06F 18/23 (2023.01); G06F 18/24 (2023.01)]

21 Claims

1. A computerized-method for generating a risk prediction model, said computerized-method comprising:

in a computerized-system comprising a processor and a memory, providing to the processor access to a data storage of transactions having data points labeled as ‘non-fraudulent’ and data points labeled as ‘fraudulent’ and operating by the processor a risk-prediction-preparation module, said risk-prediction-preparation module comprising:

(i) accessing the data storage of transactions to operate a group by operation on transactions related to the data points, according to a logical entity into entities and filtering entities having no ‘fraudulent’ data point into a clean-financial dataset;

(ii) clustering, by a clustering model, entities of the clean-financial dataset into one or more preconfigured number of clusters, wherein each cluster in the one or more preconfigured number of clusters has one or more data points labeled as ‘non-fraudulent;

(iii) selecting data points of: (a) entities from the one or more preconfigured number of clusters to a first dataset and (b) a preconfigured number of entities randomly to a second dataset;

(iv) selecting all entities that have at least one ‘fraudulent’ data point in at least one related data point to add all the entities to the first dataset and the second dataset;

(v) performing feature engineering on data points in the data storage of transactions to extract features, wherein the extracted features are vectorized and scaled;

(vi) using the vectorized and scaled extracted features for training a first machine learning model of fraud detection on the first dataset and training a second machine learning model of fraud detection on the second dataset to collect results; and

(vii) using the results of the training of the first machine learning model and the results of the second machine learning model for combining the first machine learning model and the second machine learning model to an ensemble machine learning model for risk prediction according to a preconfigured approach,

wherein the preconfigured approach comprising:

(i) performing analysis of the results of the training of the first machine learning model and the results of the training of the second machine learning model; and

(ii) comparing results errors performance and calculating a correction factor for a combined voting of the first machine learning model and the second machine learning model to yield a risk prediction model.