US 12,260,326 B2
	Data labeling for synthetic data generation
Austin Walters, Savoy, IL (US); Jeremy Goodsitt, Champaign, IL (US); and Anh Truong, Champaign, IL (US)
Assigned to Capital One Services, LLC, McLean, VA (US)
Filed by Capital One Services, LLC, McLean, VA (US)
Filed on Mar. 3, 2021, as Appl. No. 17/191,254.
Prior Publication US 2022/0284280 A1, Sep. 8, 2022
Int. Cl. G06N 3/08 (2023.01); G06F 16/23 (2019.01); G06N 3/044 (2023.01)

CPC G06N 3/08 (2013.01) [G06F 16/2379 (2019.01); G06N 3/044 (2023.01)]

20 Claims

1. A method comprising:

receiving, for each text document in a first set of text documents and based on first user input, an indication of a first set of labels, wherein a label indicates a location of confidential information in a text document and a type of the confidential information in the text document and wherein the first set of text documents is of a plurality of text documents;

modifying each document in the first set of text documents, wherein modifying a given first text document comprises redacting, based on one or more labels corresponding to the given first text document, confidential information from the given first text document;

based on the modified first set of text documents and the first set of labels, training a machine learning model to predict one or more labels based on an input text document, wherein predicting a label comprises predicting, for an input text document, a location of confidential information in the input text document and a type of the confidential information in the input text document;

predicting, using the machine learning model, a predicted set of labels for a second set of text documents, wherein the second set of text documents is of the plurality of text documents and wherein predicting a set of labels for a given second text document comprises predicting, for each label, a location of confidential information and a type of the confidential information in the given second text document;

receiving, for the second set of text documents and based on second user input, an indication of a second set of labels;

determining, based on the predicted set of labels and the second set of labels, that the machine learning model satisfies an accuracy threshold, wherein the accuracy threshold is a percentage indicating a degree of correlation between the predicted set of labels and the second set of labels;

predicting, using the machine learning model, a third set of labels for a third set of text documents, wherein the third set of text documents is of the plurality of text documents;

modifying each text document in the third set of text documents based on the third set of labels, wherein modifying a given third text document comprises redacting, based on one or more labels corresponding to the given third text document, confidential information from the given third text document; and

based on the third set of labels and modified third set of text documents and using a synthetic data generator that is configured to generate synthetic data based on indications of labels, generating one or more synthetic text documents, wherein generating a synthetic text document based on a given third text document comprises:

for each label associated with the given third text document, inserting, into the given third text document at a location associated with the label, synthetic data, wherein the synthetic data is based on the type of confidential information associated with the label and replaces the redacted confidential information associated with the label in the given third text document.