US 11,947,682 B2
ML-based encrypted file classification for identifying encrypted data movement
Yi Zhang, Santa Clara, CA (US); Siying Yang, Saratoga, CA (US); Yihua Liao, Fremont, CA (US); Dagmawi Mulugeta, London (GB); Raymond Joseph Canzanese, Jr., Philadelphia, PA (US); and Ari Azarafrooz, Rancho Santa Margarita, CA (US)
Assigned to Netskope, Inc., Santa Clara, CA (US)
Filed by Netskope, Inc., Santa Clara, CA (US)
Filed on Jul. 7, 2022, as Appl. No. 17/860,037.
Prior Publication US 2024/0012912 A1, Jan. 11, 2024
Int. Cl. G06F 21/60 (2013.01); G06F 9/54 (2006.01); H04L 41/16 (2022.01)
CPC G06F 21/602 (2013.01) [G06F 9/547 (2013.01); H04L 41/16 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method of detecting exfiltration designed to defeat data loss protection (DLP) by encryption before evaluation, including:
intercepting, by a network security system server interposed on a network between a cloud-based application and a user endpoint, movement of a plurality of files by a user over the network to the cloud-based application, wherein the network security system server monitors traffic on the network associated with the user endpoint of the user;
detecting, by the network security system server, file encryption for each file of the plurality of files using a trained machine learning (ML) classifier, wherein the detecting comprises:
for each file of the plurality of files:
determining a file type of the respective file,
calculating two or more metrics for the respective file, the two or more metrics selected from:
a chi-square metric based on a chi-square randomness test that measures a degree to which a distribution of sampled bytes varies from an expected distribution of bytes from the respective file;
an arithmetic mean metric based on an arithmetic mean test that compares an arithmetic mean of the sampled bytes to an expected mean of the bytes from the respective file;
a serial correlation coefficient metric based on a serial correlation coefficient test that calculates a serial correlation coefficient between pairs of successive sampled bytes from the respective file;
a Monte Carlo-Pi metric based on a Monte Carlo-Pi test that maps concatenated bytes as coordinates of a square and calculates a degree to which a proportion of the mapped concatenated bytes that fall within a circle circumscribed by the square varies from an expected proportion that corresponds to mapping from the respective file; and
an entropy metric based on a Shannon entropy test of randomness of the respective file,
providing the two or more metrics and the file type as input to the trained ML classifier trained to classify the respective file as encrypted or unencrypted based on the two or more metrics and the file type, and
receiving a classification of the respective file from the trained ML classifier based on the input;
counting, by the network security system server, a number of the plurality of files classified as encrypted and moved by the user during a predetermined period of time;
determining, by the network security system server, a predetermined maximum number of encrypted files the user is allowed to move during the predetermined period of time;
comparing, by the network security system server, the predetermined maximum number for the user to the number counted;
detecting, by the network security system server, based on the comparing, that the user has moved more encrypted files than the predetermined maximum number; and
generating, by the network security system server, an alert that the user has moved more than the predetermined maximum number of encrypted files allowed to be moved.