CPC G06F 21/6254 (2013.01) [G06N 20/00 (2019.01); G16H 10/60 (2018.01)] | 19 Claims |
1. A de-identification method comprising:
receiving a plurality of data sets, wherein the plurality of data sets comprises:
a first data set, wherein the first data set comprises a labeled data set for one or more entity types; and
a second data set, wherein the training data set comprises an unlabeled data set for the one or more entity types;
determining one machine-learning model from a plurality of machine-learning models for each of one or more entity types;
fine-tuning the determined machine-learning model for each of the one or more entity types, wherein fine-tuning the determined machine-learning model comprises:
creating a plurality of training data sets, wherein the plurality of training data sets comprises:
a first training data set, wherein the first training data set comprises the first data set; and
a second training data set, wherein the second training data set comprises the second data set;
training the determined machine-learning model using the first training data set;
validating the trained machine-learning model, wherein validating the trained machine learning model further comprises:
generating a recall score for each entity type of the one or more entity types;
comparing the recall score to a threshold for the recall score for each entity type of the one or more entity types; and
updating the trained machine-learning model using the second training data set as a function of the validation; and
obfuscating the second data set using the fine-tuned machine-learning model.
|