US 12,386,919 B2
Synthetic data generation for machine learning model simulation
Akash Singh, Gurgaon (IN); Debadri Basak, Kolkata (IN); Mohan Krishna Kusuma, Amalapuram (IN); Rajdeep Dua, Hyderabad (IN); Gowri Shankar Raju Kurapati, Guntur (IN); and Shashank Tyagi, Hyderabad (IN)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Jan. 11, 2022, as Appl. No. 17/573,585.
Prior Publication US 2023/0222178 A1, Jul. 13, 2023
Int. Cl. G06N 20/00 (2019.01); G06F 3/0482 (2013.01); G06F 18/214 (2023.01); G06N 5/025 (2023.01)
CPC G06F 18/2148 (2023.01) [G06F 3/0482 (2013.01); G06N 5/025 (2013.01); G06N 20/00 (2019.01)] 21 Claims
OG exemplary drawing
 
1. A method comprising:
receiving, at a synthetic data generation system, a schema configuration file in a synthetic data set request from a client application, wherein the schema configuration file defines a plurality of input features and a target variable to be generated for each item in a synthetic data set, wherein the schema configuration file includes feature characteristics for each of the plurality of input features, wherein the schema configuration file includes a set of one or more distribution parameters for a first input feature of the plurality of input features, wherein the schema configuration file includes a set of one or more correlations between the first input feature and the target variable for each item in the synthetic data set;
generating, by the synthetic data generation system, a plurality of items for the synthetic data set based on the schema configuration file by performing the following for the plurality of items:
generating synthetic data for the plurally of input features based on the feature characteristics and the set of one or more distribution parameters included in the schema configuration file;
generating a set of one or more if-then else rules based at least on the set of one or more correlations, wherein the set of one or more if-then else rules express how the synthetic data generated for the first input feature should effect generation of synthetic data for the target variable; and
generating synthetic data for the target variable based at least on the synthetic data generated for the first input feature and the set of one or more correlations between the first input feature and the target variable included in the schema configuration file, wherein generating the synthetic data for the target variable includes applying at least one of the set of one or more if-then else rules; and
training a machine learning model (ML) using the generated synthetic data.