US 11,853,415 B1
Context-based identification of anomalous log data
Douglas George Wainer, Dublin (IE)
Assigned to Rapid7, Inc., Boston, MA (US)
Filed by Rapid7, Inc., Boston, MA (US)
Filed on Dec. 9, 2020, as Appl. No. 17/116,419.
Claims priority of provisional application 62/947,032, filed on Dec. 12, 2019.
Int. Cl. G06F 16/17 (2019.01); G06F 21/55 (2013.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 20/00 (2019.01)
CPC G06F 21/552 (2013.01) [G06F 16/1734 (2019.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G06N 20/00 (2019.01)] 17 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
performing, by one or more hardware processors with associated memory that implement a context-based anomalous log data identification system:
receiving log data comprising a plurality of logs;
generating a context associated training dataset, comprising
splitting a string in a log of the plurality of logs into a plurality of split strings,
generating a context association between each of the plurality of split strings and a unique key that corresponds to the log, and
generating an input/output (I/O) string data batch comprising I/O string data for each split string in the log by training each split string against every other split string of the plurality of split strings in the log; and
training a context-based anomalous log data identification model using the I/O string data batch comprising a list of unique strings in the context associated training dataset and according to a machine learning technique, wherein
the training tunes the context-based anomalous log data identification model to classify or cluster a vector associated with a new string in a new log that is not part of the plurality of logs as anomalous,
training the context-based anomalous log data identification model to perform cluster analysis is based on whether an executable that is part of the process information is a good executable that is part of a bad path, and
the good executable and the bad path are pre-identified based at least on a classifier prior to performing the cluster analysis.