| CPC G06N 3/08 (2013.01) [G06F 16/93 (2019.01); G06N 3/04 (2013.01)] | 20 Claims |

|
1. A computer-implemented method, the method comprising:
applying one or more rules to identify one or more structural elements of a document, wherein the one or more structural elements are indicative of at least one of an organization of text within the document and a visual presentation of text within the document;
determining, based at least in part on the one or more structural elements, one or more pairs of words within the document having a hypernym relationship;
extracting de-identified content within the document based on one or more de-identification techniques applied to the document;
applying a set of causal rules to the de-identified content and the one or more pairs of words by computing mutual dependence and/or mutual independence between the de-identified content and the one or more pairs of words;
annotating, based at least in part on the computed mutual dependence and/or mutual independence, at least a portion of the de-identified content as belonging to a class of protected content; and
training a machine learning model on a set of training data to automatically identify at least a portion of the de-identification techniques in at least one other document, wherein the set of training data comprises the annotated portion of the de-identified content;
wherein the method is carried out by at least one computing device.
|