US 12,073,246 B2
Data curation with synthetic data generation
Vengateswaran Chandrasekaran, Trichy (IN); Manan Dhyani, Mumbai (IN); Amit Joshi, Bengaluru (IN); Sriram Narasimhan, Pleasanton, CA (US); and Vinay Santurkar, Santa Clara, CA (US)
Assigned to SAP SE, Walldorf (DE)
Filed by SAP SE, Walldorf (DE)
Filed on Jun. 25, 2021, as Appl. No. 17/358,979.
Prior Publication US 2022/0413905 A1, Dec. 29, 2022
Int. Cl. G06F 3/00 (2006.01); G06F 9/48 (2006.01); G06F 16/23 (2019.01)
CPC G06F 9/4881 (2013.01) [G06F 16/2365 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A system, comprising:
at least one processor; and
at least one memory including program code which when executed by the at least one processor provides operations comprising:
identifying an identifier field included in a first datatype of a seed data sample associated with a source system, the identifier field storing a first value that enables a differentiation between different instances of the first datatype, wherein the seed data sample comprises data that occurs at the source system;
identifying a relationship field included in the first datatype of the seed data sample, the relationship field storing a second value that defines a relationship between the first datatype of the seed data sample and a second datatype;
generating, based at least on the seed data sample, a first synthetic data sample, the generating includes populating the identifier field of the first synthetic data sample with a first synthetically generated value and the relationship field of the first synthetic data sample with the second value, wherein the first synthetic data sample is generated to supplement a volume and a diversity of the data at the source system; and
sending, to a target system, the first synthetic data sample to enable a performance of a task at the target system;
in response to determining that the first datatype is a parent datatype to a second datatype, propagating, to a second synthetic data sample of the second datatype, a change corresponding to the first synthetically generated value populating the identifier field of the first synthetic data sample.