US 12,131,230 B1
	Feature equivalence and document abnormality threshold determination
Daniel Scofield, Portland, OR (US); and Craig Miles, Beaverton, OR (US)
Assigned to Assured Information Security, Inc., Rome, NY (US)
Filed by Assured Information Security, Inc., Rome, NY (US)
Filed on Aug. 4, 2020, as Appl. No. 16/984,648.
Claims priority of provisional application 62/964,885, filed on Jan. 23, 2020.
Int. Cl. G06N 20/00 (2019.01); G06F 16/906 (2019.01); G06F 21/50 (2013.01)

CPC G06N 20/00 (2019.01) [G06F 16/906 (2019.01); G06F 21/50 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

selecting a feature merging threshold (α), from a set of candidate α values, the set comprising multiple α values, and the feature merging threshold α being for determining equivalence between two features, wherein the selecting considers all of the multiple α values together in training respective model whitelists for the multiple α values, and wherein the selecting comprises:

partitioning training data into a plurality of groups;

establishing a respective model W_α for each α value of the set of candidate α values, the establishing producing multiple model W_α, each corresponding to an α value of the multiple α values;

iteratively performing, using α training set:

selecting a next group of training data of the plurality of groups of training data;

adding the selected next group of training data to the training set;

for each α value in the set of candidate α values:

training the W_α for the α value using the training set with the added selected next group of training data, wherein the training comprises (i) monitoring, by hooking machine instructions executing on a system, function calls invoked by an application based on the application opening and rendering documents of the training set, and (i) merging features, determined from the monitoring, according to the α value, to produce the trained W_α for the α value; and

evaluating a size of W_α, the size comprising a number of features included in the model after the training the W_α for the α value using the training set with the added selected next group of training data;

wherein whether to continue the iteratively performing is based at least in part on the evaluated size of every W_α for the set of candidate α values as a result of the training; and

choosing the feature merging threshold α based on the iteratively performing.