US 12,367,229 B2
System and method for integrating machine learning in data leakage detection solution through keyword policy prediction
Ahmad F. Sirhani, Dammam (SA); Abdullah K. Madani, Dhahran (SA); and Abdulrahman M. Alomar, Al Hasa (SA)
Assigned to SAUDI ARABIAN OIL COMPANY, Dhahran (SA)
Filed by SAUDI ARABIAN OIL COMPANY, Dhahran (SA)
Filed on May 25, 2022, as Appl. No. 17/804,055.
Prior Publication US 2023/0385407 A1, Nov. 30, 2023
Int. Cl. G06F 16/35 (2025.01); G06F 16/93 (2019.01); G06F 21/55 (2013.01)
CPC G06F 16/35 (2019.01) [G06F 16/93 (2019.01); G06F 21/554 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving, by a data leakage prevention (DLP) system comprising a computer processor, a machine learning (ML) system, a memory, and a data fetcher, a corpus of labelled documents using the data fetcher in the DLP system from a SQL-based repository and a plurality of filters comprising an organization filter,
wherein the corpus of labelled documents comprise at least one email, at least one spreadsheet, and at least one binary file,
wherein the data fetcher comprises an interface connected to the SQL-based repository that obtains the corpus of labelled documents based on one or more SQL commands comprising one or more queries based on the plurality of filters, and
wherein the ML system comprises a parser, a vectorizer, and a machine-learned model;
identifying, automatically by the DLP system, a plurality of parsed words in the corpus of labelled documents using the parser in the ML system;
vectorizing, by the DLP system, the corpus of labelled documents using the vectorizer in the ML system, comprising:
generating a matrix, wherein each of the rows of the matrix correspond to a labelled document and each of the columns of the matrix correspond to a parsed word from the plurality of parsed words,
determining, for each labelled document and using a Natural Language Processing (NLP) technique, a numerical value for each parsed word in the plurality of parsed words forming a vectorized document for that labelled document, and
populating the matrix with the with the numerical value of each parsed word in the plurality of parsed words for each labeled document;
training, by the ML system in the DLP system, the machine-learned model comprising:
predicting a document class for at least one vectorized document of the matrix,
comparing the predicted document class to a corresponding label of the at least one vectorized document, and
updating or determining one or more parameters of the machine-learned model based on the comparison,
wherein the trained machine-learned model accepts, as input, a vectorized document and outputs at least one predicted document class from a plurality of document classes comprising a sensitive document class and a non-sensitive document class;
extracting a plurality of word importances from the trained machine-learned model based on the one or more parameters, wherein the plurality of word importances comprises a word importance for each parsed word of the plurality of parsed words;
determining, automatically by the DLP system, a keyword-based policy based on the plurality of word importances and a portion of the plurality of parsed words that satisfy a criterion;
obtaining, by the DLP system, a new document; and
automatically classifying, by the DLP system, the new document as a sensitive document based on the keyword-based policy.