US 12,242,568 B2
Guided augmentation of data sets for machine learning models
Ariel Gedaliah Kobren, Cambridge, MA (US); Swetasudha Panda, Burlington, MA (US); Michael Louis Wick, Lexington, MA (US); Qinlan Shen, Burlington, MA (US); and Jason Anthony Peck, Andover, MA (US)
Assigned to Oracle International Corporation, Redwood Shores, CA (US)
Filed by Oracle International Corporation, Redwood Shores, CA (US)
Filed on Sep. 6, 2022, as Appl. No. 17/903,798.
Claims priority of provisional application 63/352,110, filed on Jun. 14, 2022.
Prior Publication US 2023/0401286 A1, Dec. 14, 2023
Int. Cl. G06F 18/214 (2023.01); G06F 40/56 (2020.01)
CPC G06F 18/2148 (2023.01) [G06F 40/56 (2020.01)] 21 Claims
OG exemplary drawing
 
1. One or more non-transitory computer-readable media storing program instructions that, when executed by one or more hardware processors, cause performance of operations comprising:
constructing a set of training inputs for training a synthetic data generation model, wherein the constructing the set of training inputs comprises:
extracting, from a first training data set, a plurality of pairs of sentences that meet a similarity criterion, individual pairs of the plurality of pairs including a first sentence and a second sentence;
for the individual pairs of the plurality of pairs:
extracting a first subset of words from the first sentence, the first subset excluding one or more words included in the first sentence; and
generating a first training instance from the set of training inputs, the first training instance comprising: (a) a model input including the second sentence and the subset of words from the first sentence, and (b) a model output including the first sentence; and
training the synthetic data generation model using the set of training inputs.