US 11,657,307 B1
Data lake-based text generation and data augmentation for machine learning training
Sravan Babu Bodapati, Bellevue, WA (US); Rishita Rajal Anubhai, Seattle, WA (US); Georgiana Dinu, New York, NY (US); and Yaser Al-Onaizan, Cortlandt Manor, NY (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Nov. 27, 2019, as Appl. No. 16/697,747.
Int. Cl. G06N 5/043 (2023.01); G06N 20/00 (2019.01); G06F 40/20 (2020.01); G06V 30/40 (2022.01); G06F 18/22 (2023.01); G06F 18/214 (2023.01)
CPC G06N 5/043 (2013.01) [G06F 18/22 (2023.01); G06F 40/20 (2020.01); G06N 20/00 (2019.01); G06V 30/40 (2022.01); G06F 18/214 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving, at a service of a multi-tenant provider network from a computing device of a user located outside the multi-tenant provider network, a first plurality of documents and a first plurality of labels;
storing the first plurality of documents and the first plurality of labels at one or more storage locations within the multi-tenant provider network;
receiving, from the computing device, a request originated by the computing device of the user to create a document classifier, the request identifying the one or more storage locations;
generating a second plurality of documents and a second plurality of labels based at least in part on the first plurality of documents, the first plurality of labels, and a repository of documents, wherein at least one document of the second plurality of documents does not exist within both the first plurality of documents and the repository of documents;
training a machine learning (ML) model using a training dataset comprising at least the first plurality of documents, the first plurality of labels, the second plurality of documents, and the second plurality of labels; and
hosting the ML model within the multi-tenant provider network in association with an endpoint;
receiving an inference request at the endpoint;
generating, by the ML model, an inference based on the inference request; and
transmitting the inference to a client application or to a storage location.