| CPC G06F 18/2148 (2023.01) [G06F 21/6254 (2013.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06N 3/02 (2013.01)] | 18 Claims |

|
1. A computer implemented method, comprising:
obtaining an information document corresponding to an entity, wherein the information document comprises redacted information spans which redact sensitive or personal information in the information document;
identifying an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, identifying the entity type by generating, using a neural network, at least three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token;
replacing the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises maintaining relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and
generating, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model.
|