US 12,354,011 B2
	Data augmentation using machine translation capabilities of language models
Subham Biswas, Maharashtra (IN); and Saurabh Tahiliani, Uttar Pradesh (IN)
Assigned to Verizon Patent and Licensing Inc., Basking Ridge, NJ (US)
Filed by VERIZON PATENT AND LICENSING INC., Basking Ridge, NJ (US)
Filed on Aug. 11, 2021, as Appl. No. 17/399,431.
Prior Publication US 2023/0050134 A1, Feb. 16, 2023
Int. Cl. G06N 20/00 (2019.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/084 (2023.01)

CPC G06N 3/084 (2013.01) [G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)]

16 Claims

1. A method comprising:

receiving a seed example, the seed example stored in a seed training data set;

encoding the seed example using a pre-trained encoder trained using a masked language model (MLM) training objective, the encoder outputting an encoded seed example as a vector representation;

inputting the encoded seed example into a recurrent neural network (RNN), the RNN configured to generate a candidate example, the candidate example comprising a new example;

determining that the candidate examples is similar to the encoded seed example by: generating a vector representation of the candidate example using the pre-trained encoder used to encode the seed example, computing a similarity score between the vector representation of the candidate example and the encoded seed example using at least one of cosine similarity or Euclidean distance in a single vector space; and determining the similarity score exceeds a configured threshold; and

augmenting the seed training data set with the candidate example, wherein the augmented seed training data set is used to train a machine learning model to improve prediction accuracy.