US 12,412,087 B2
	Classifying data from de-identified content
Aswin Kannan, Chennai (IN); Balaji Ganesan, Bengaluru (IN); and Shanmukha Chaitanya Guttula, Vijayawada (IN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Apr. 23, 2021, as Appl. No. 17/238,567.
Prior Publication US 2022/0343151 A1, Oct. 27, 2022
Int. Cl. G06N 3/08 (2023.01); G06F 16/93 (2019.01); G06N 3/04 (2023.01)

CPC G06N 3/08 (2013.01) [G06F 16/93 (2019.01); G06N 3/04 (2013.01)]

20 Claims

1. A computer-implemented method, the method comprising:

applying one or more rules to identify one or more structural elements of a document, wherein the one or more structural elements are indicative of at least one of an organization of text within the document and a visual presentation of text within the document;

determining, based at least in part on the one or more structural elements, one or more pairs of words within the document having a hypernym relationship;

extracting de-identified content within the document based on one or more de-identification techniques applied to the document;

applying a set of causal rules to the de-identified content and the one or more pairs of words by computing mutual dependence and/or mutual independence between the de-identified content and the one or more pairs of words;

annotating, based at least in part on the computed mutual dependence and/or mutual independence, at least a portion of the de-identified content as belonging to a class of protected content; and

training a machine learning model on a set of training data to automatically identify at least a portion of the de-identification techniques in at least one other document, wherein the set of training data comprises the annotated portion of the de-identified content;

wherein the method is carried out by at least one computing device.