CPC H04L 9/3213 (2013.01) [G06F 16/355 (2019.01); G06F 16/907 (2019.01); G06F 16/93 (2019.01); G06F 40/284 (2020.01)] | 20 Claims |
1. A computer implemented method for protecting sensitive information in documents, the method comprising:
providing, in a computer database, an inverted text index for a set of documents;
evaluating via a processor one or more statistical measures of index tokens, respectively, of the inverted text index, the one or more statistical measures of the respective index token comprising at least one member selected from a group consisting of: one or more of a number of documents of the set of documents containing the index token, a frequency of occurrence of the index token in the set of documents, and a frequency of occurrence of a token type of the index token in the set of documents;
selecting, based on the evaluation of the one or more statistical measures and via the processor, a set of candidate tokens that may contain sensitive information, the selecting the set of candidate tokens comprising comparing the one or more statical measures with a respective predefined threshold;
extracting, via the processor, metadata from the inverted text index descriptive of the candidate tokens, respectively, wherein the extracted metadata comprises at least a token type of the index tokens, respectively, and a document identifier of a respective document containing a respective index token;
receiving, via the processor, a request of at least one document;
tokenizing, via the processor, the requested at least one document, resulting in document tokens;
comparing, via the processor, the document tokens with the set of candidate tokens;
selecting, via the processor, a set of document tokens to be masked based on the comparison;
selecting, via the processor, at least part of the set of document tokens that comprises sensitive information according to the extracted metadata;
masking, via the processor, the at least part of the set of document tokens in the at least one document, resulting in one or more masked documents; and
providing, via the processor, the one or more masked documents.
|