CPC G06F 18/2148 (2023.01) [G06F 40/56 (2020.01)] | 21 Claims |
1. One or more non-transitory computer-readable media storing program instructions that, when executed by one or more hardware processors, cause performance of operations comprising:
constructing a set of training inputs for training a synthetic data generation model, wherein the constructing the set of training inputs comprises:
extracting, from a first training data set, a plurality of pairs of sentences that meet a similarity criterion, individual pairs of the plurality of pairs including a first sentence and a second sentence;
for the individual pairs of the plurality of pairs:
extracting a first subset of words from the first sentence, the first subset excluding one or more words included in the first sentence; and
generating a first training instance from the set of training inputs, the first training instance comprising: (a) a model input including the second sentence and the subset of words from the first sentence, and (b) a model output including the first sentence; and
training the synthetic data generation model using the set of training inputs.
|