US 10,891,311 B2
Method for generating synthetic data sets at scale with non-redundant partitioning
Jay Vyas, Concord, MA (US); Ronald Nowling, West Allis, WI (US); and Huamin Chen, Westborough, MA (US)
Assigned to RED HAT, INC., Raleigh, NC (US)
Filed by Red Hat, Inc., Raleigh, NC (US)
Filed on Oct. 14, 2016, as Appl. No. 15/294,142.
Prior Publication US 2018/0107729 A1, Apr. 19, 2018
Int. Cl. G06F 16/28 (2019.01); G06N 20/00 (2019.01); G16H 10/60 (2018.01); G16H 50/70 (2018.01); G06N 7/00 (2006.01)
CPC G06F 16/285 (2019.01) [G06N 7/005 (2013.01); G06N 20/00 (2019.01); G16H 10/60 (2018.01); G16H 50/70 (2018.01)] 19 Claims
OG exemplary drawing
 
1. A method comprising:
receiving, by a clustering module, a plurality of data sets, wherein each data set of the plurality of data sets includes a plurality of attributes;
partitioning, by the clustering module, the plurality of data sets into a plurality of clustered data sets including at least a first clustered data set and a second clustered data set, wherein each data set of the plurality of data sets is partitioned into one of the plurality of clustered data sets;
assigning, by a training module, a respective stochastic model to each respective clustered data set of the plurality of clustered data sets including:
assigning a first stochastic model to the first clustered data set, and
assigning a second stochastic model to the second clustered data set;
selecting, by a first machine including a first memory and one or more processors in communication with the first memory, the first clustered data set and the first stochastic model;
selecting, by a second machine that is different from the first machine, the second machine including a second memory and one or more processors in communication with the second memory, the second clustered data set and the second stochastic model;
generating, by the first machine with the first stochastic model, a first synthetic data set, wherein the first synthetic data set has generated data for each one of the plurality of attributes;
generating, by the second machine with the second stochastic model, a second synthetic data set, wherein the second synthetic data set has generated data for each one of the plurality of attributes; and
testing at least one of an application and a database using each of the first synthetic data set and the second synthetic data set.