US 12,314,441 B1
	Privacy preservation within datasets
Syed Kashif Hussain Shah, Santa Clara, CA (US); Kalpit Dixit, Mountain View, CA (US); Yuchen Tian, Mountain View, CA (US); Jie Ma, Seattle, WA (US); and Yaser Al-Onaizan, Cortlandt Manor, NY (US)
Assigned to Amazon Technologies, Inc., Reno, NV (US)
Filed by Amazon Technologies, Inc., Reno, NV (US)
Filed on Sep. 13, 2021, as Appl. No. 17/473,471.
Int. Cl. G06F 40/40 (2020.01); G06F 16/93 (2019.01); G06F 21/62 (2013.01); G06N 20/00 (2019.01)

CPC G06F 21/6254 (2013.01) [G06F 16/93 (2019.01); G06N 20/00 (2019.01)]

18 Claims

1. A computer-implemented method, comprising:

determining a document from a plurality of documents includes personally identifiable information (PII);

determining, within the document, two or more spans corresponding to the PII, the two or more spans including metadata corresponding to an individual span type;

replacing a first span of the two or more spans with one or more first replacement values, the one or more first replacement values corresponding to the individual span type associated with the first span of the two or more spans;

replacing a second span of the two or more spans with one or more second replacement values, different from the one or more first replacement values, the one or more second replacement values corresponding to the individual span type associated with the second span of the two or more spans;

generating a dataset, the dataset including at least the document and at least one other document from the plurality of documents;

determining the dataset includes one or more biases corresponding to one or more elements that are over-represented or under-represented within the dataset, based at least in part on one or more dataset properties; and

modifying the dataset, based at least in part on the one or more dataset properties.