US 12,131,230 B1
Feature equivalence and document abnormality threshold determination
Daniel Scofield, Portland, OR (US); and Craig Miles, Beaverton, OR (US)
Assigned to Assured Information Security, Inc., Rome, NY (US)
Filed by Assured Information Security, Inc., Rome, NY (US)
Filed on Aug. 4, 2020, as Appl. No. 16/984,648.
Claims priority of provisional application 62/964,885, filed on Jan. 23, 2020.
Int. Cl. G06N 20/00 (2019.01); G06F 16/906 (2019.01); G06F 21/50 (2013.01)
CPC G06N 20/00 (2019.01) [G06F 16/906 (2019.01); G06F 21/50 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
selecting a feature merging threshold (α), from a set of candidate α values, the set comprising multiple α values, and the feature merging threshold α being for determining equivalence between two features, wherein the selecting considers all of the multiple α values together in training respective model whitelists for the multiple α values, and wherein the selecting comprises:
partitioning training data into a plurality of groups;
establishing a respective model Wα for each α value of the set of candidate α values, the establishing producing multiple model Wα, each corresponding to an α value of the multiple α values;
iteratively performing, using α training set:
selecting a next group of training data of the plurality of groups of training data;
adding the selected next group of training data to the training set;
for each α value in the set of candidate α values:
training the Wα for the α value using the training set with the added selected next group of training data, wherein the training comprises (i) monitoring, by hooking machine instructions executing on a system, function calls invoked by an application based on the application opening and rendering documents of the training set, and (i) merging features, determined from the monitoring, according to the α value, to produce the trained Wα for the α value; and
evaluating a size of Wα, the size comprising a number of features included in the model after the training the Wα for the α value using the training set with the added selected next group of training data;
wherein whether to continue the iteratively performing is based at least in part on the evaluated size of every Wα for the set of candidate α values as a result of the training; and
choosing the feature merging threshold α based on the iteratively performing.