US 12,242,605 B2
	Script classification on computing platform
Merrielle Therese Spain, Kirkland, WA (US); Timothy Dylan Peacock, San Francisco, CA (US); and John Edward Davis, Kirkland, WA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Aug. 4, 2022, as Appl. No. 17/880,793.
Claims priority of provisional application 63/230,455, filed on Aug. 6, 2021.
Prior Publication US 2023/0053322 A1, Feb. 16, 2023
Int. Cl. G06F 21/56 (2013.01); G06N 3/08 (2023.01)

CPC G06F 21/563 (2013.01) [G06N 3/08 (2013.01); G06F 2221/033 (2013.01)]

1 Claim

1. A system comprising:

one or more processors in communication with a computing node of a computing platform, the one or more processors configured to:

receive a script from the computing node;

tokenize the script into a plurality of tokens;

generate the output classification of the script as benign or malicious to the computing platform, using a classification machine learning model trained to classify scripts; and

send the output classification to the computing node,

wherein the one or more processors are further configured to train the classification machine learning model, wherein in training the classification machine learning model, the one or more processors are configured to:

receive script training data comprising a plurality of training scripts;

tokenize the plurality of training scripts to generate a plurality of tokens;

for each training script, generate a respective integer vector of the training script comprising a plurality of elements, each element corresponding to a number of occurrences of respective one or more token or hashed tokens in the training script;

generate a map between the generated integer vectors and a plurality of binary vectors, each binary vector comprising a plurality of binary elements, the map based on projecting the generated vectors into a dimension lower than a dimension for the generated integer vectors, while preserving respective similarity between the generated integer vectors when the generated integer vectors are compared using a distance function;

generate one or more groups of connected binary vectors;

generate, using the one or more groups of connected binary vectors, a training set, a validation set, and a testing set of training scripts, wherein for each group of binary vectors, training scripts corresponding to the group are either all in the training set, all in the validation set, or all in the testing set; and

train the classification machine learning model to classify scripts using the generated training, validation, and testing sets, and

wherein in generating the training set, validation set, and the testing set, the one or more processors are configured to split:

approximately 60 percent of the script training data to the training set,

approximately 20 percent of the script training data to the validation set, and

approximately 20 percent of the script training data to the testing set.