US 12,069,079 B1
Generating synthetic datapoints from observed datapoints for training machine learning models
Jocelyn Beauchesne, Saint-Lormal (FR); John Lim Oh, Mukilteo, WA (US); Vasudha Shivamoggi, Cambridge, MA (US); and Roy Donald Hodgman, Cambridge, MA (US)
Assigned to Rapid7, Inc., Boston, MA (US)
Filed by Rapid7, Inc., Boston, MA (US)
Filed on Oct. 17, 2022, as Appl. No. 17/967,243.
Application 17/967,243 is a continuation of application No. 17/024,506, filed on Sep. 17, 2020, granted, now 11,509,674.
Int. Cl. H04L 9/40 (2022.01); G06N 5/04 (2023.01); G06N 20/00 (2019.01)
CPC H04L 63/1425 (2013.01) [G06N 5/04 (2013.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A system comprising:
one or more computing devices that implement a synthetic data generation system, configured to:
obtain a plurality of observed datapoints in a feature space encoding metadata of hosts;
select an observed datapoint from the plurality of observed datapoints;
select a direction of the synthetic datapoint relative to the observed datapoint in the feature space;
generate a plurality of synthetic datapoints in the direction with increasing distances;
stop the generation of the synthetic datapoints in response to a determination that a probability of observing a last one of the synthetic datapoints is less than a specified threshold; and
add the synthetic datapoints to a dataset, wherein the dataset is used to train or test one or more machine learning models used to analyze the metadata.