US 12,321,702 B2
	Automatically augmenting and labeling conversational data for training machine learning models
Deepa Mohan, Los Altos, CA (US); Komal Arvind Dhuri, San Jose, CA (US); Simral Chaudhary, Sunnyvale, CA (US); and Jorge Adrian Sanchez Castro, Long Island City, NY (US)
Assigned to WALMART APOLLO, LLC, Bentonville, AR (US)
Filed by Walmart Apollo, LLC, Bentonville, AR (US)
Filed on Jan. 31, 2022, as Appl. No. 17/589,860.
Prior Publication US 2023/0244871 A1, Aug. 3, 2023
Int. Cl. G10L 15/06 (2013.01); G06F 18/2431 (2023.01); G06F 40/289 (2020.01); G06F 40/284 (2020.01); G06N 20/00 (2019.01)

CPC G06F 40/289 (2020.01) [G06F 18/2431 (2023.01); G06F 40/284 (2020.01); G06N 20/00 (2019.01); G10L 15/063 (2013.01)]

20 Claims

1. A system comprising:

one or more processors; and

one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, cause the one or more processors to perform:

generating training data for an intent classification machine learning model by:

determining, via a text-to-text machine learning model, one or more respective paraphrases for each sample phrase of training phrases, wherein:

a respective quantity of the one or more respective paraphrases varies for the each sample phrase of the training phrases;

generating, via a label generating machine learning model, labeled data based on unlabeled live logs by:

determining live-log samples from the unlabeled live logs, comprising:

stratifying the unlabeled live logs into multiple data bins based on a respective timestamp of each of the unlabeled live logs; and

randomly selecting respective unlabeled live logs from each of the multiple data bins to add to the live-log samples;

wherein a respective quantity of the respective unlabeled live logs for the each of the multiple data bins is: (a) a predetermined number; or (b) proportional to a respective size of the each of the multiple data bins; and

generating, via the label generating machine learning model, the labeled data based on the live-log samples and one or more labeling functions; and

adding the one or more respective paraphrases for the each sample phrase of the training phrases and the labeled data to the training data; and

transmitting the training data, as generated, to the intent classification machine learning model for training.