CPC G06F 21/563 (2013.01) [G06N 3/08 (2013.01); G06F 2221/033 (2013.01)] | 1 Claim |
1. A system comprising:
one or more processors in communication with a computing node of a computing platform, the one or more processors configured to:
receive a script from the computing node;
tokenize the script into a plurality of tokens;
generate the output classification of the script as benign or malicious to the computing platform, using a classification machine learning model trained to classify scripts; and
send the output classification to the computing node,
wherein the one or more processors are further configured to train the classification machine learning model, wherein in training the classification machine learning model, the one or more processors are configured to:
receive script training data comprising a plurality of training scripts;
tokenize the plurality of training scripts to generate a plurality of tokens;
for each training script, generate a respective integer vector of the training script comprising a plurality of elements, each element corresponding to a number of occurrences of respective one or more token or hashed tokens in the training script;
generate a map between the generated integer vectors and a plurality of binary vectors, each binary vector comprising a plurality of binary elements, the map based on projecting the generated vectors into a dimension lower than a dimension for the generated integer vectors, while preserving respective similarity between the generated integer vectors when the generated integer vectors are compared using a distance function;
generate one or more groups of connected binary vectors;
generate, using the one or more groups of connected binary vectors, a training set, a validation set, and a testing set of training scripts, wherein for each group of binary vectors, training scripts corresponding to the group are either all in the training set, all in the validation set, or all in the testing set; and
train the classification machine learning model to classify scripts using the generated training, validation, and testing sets, and
wherein in generating the training set, validation set, and the testing set, the one or more processors are configured to split:
approximately 60 percent of the script training data to the training set,
approximately 20 percent of the script training data to the validation set, and
approximately 20 percent of the script training data to the testing set.
|