| CPC G06N 3/084 (2013.01) [G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)] | 16 Claims |

|
1. A method comprising:
receiving a seed example, the seed example stored in a seed training data set;
encoding the seed example using a pre-trained encoder trained using a masked language model (MLM) training objective, the encoder outputting an encoded seed example as a vector representation;
inputting the encoded seed example into a recurrent neural network (RNN), the RNN configured to generate a candidate example, the candidate example comprising a new example;
determining that the candidate examples is similar to the encoded seed example by: generating a vector representation of the candidate example using the pre-trained encoder used to encode the seed example, computing a similarity score between the vector representation of the candidate example and the encoded seed example using at least one of cosine similarity or Euclidean distance in a single vector space; and determining the similarity score exceeds a configured threshold; and
augmenting the seed training data set with the candidate example, wherein the augmented seed training data set is used to train a machine learning model to improve prediction accuracy.
|