US 11,899,669 B2
Searching of data structures in pre-processing data for a machine learning classifier
Jonathan Cagan, Pittsburgh, PA (US); Phil LeDuc, Pittsburgh, PA (US); and Mark Whiting, Pittsburgh, PA (US)
Assigned to Carnegie Mellon University, Pittsburgh, PA (US)
Filed by Carnegie Mellon University, Pittsburgh, PA (US)
Filed on Mar. 20, 2018, as Appl. No. 15/926,790.
Claims priority of provisional application 62/601,370, filed on Mar. 20, 2017.
Prior Publication US 2018/0276278 A1, Sep. 27, 2018
Int. Cl. G06N 20/00 (2019.01); G06F 16/2455 (2019.01); G06N 5/046 (2023.01); G06F 16/901 (2019.01); G06F 16/36 (2019.01)
CPC G06F 16/24564 (2019.01) [G06F 16/367 (2019.01); G06F 16/9024 (2019.01); G06N 5/046 (2013.01); G06N 20/00 (2019.01)] 29 Claims
OG exemplary drawing
 
1. A data processing system configured to pre-process data for a machine learning classifier, the data processing system comprising:
an input port that receives one or more data items;
a shared memory data store that stores the one or more data items, with each of the one or more data items being written to the shared memory data store; and
at least one processor configured to perform operations comprising:
extracting, from a data item of the one or more data items written to the shared memory data store, a plurality of data signatures and structure data representing relationships among the data signatures, wherein a type of the data signatures is based on a domain of the one or more data items;
generating, based on the type of the data signatures, a data structure from the plurality of data signatures, wherein the data structure includes a plurality of nodes connected with edges, each node in the data structure represents a data signature, and wherein each edge specifies a relationship between a first node and a second node, with the specified relationship corresponding to a relationship represented in the structure data for data signatures represented by those first and second nodes;
selecting a particular data signature of the data structure;
for the particular data signature of the data structure that is selected,
identifying each instance of the particular data signature in the data structure;
segmenting, based on the type of the data signatures, the data structure around instances of the particular data signature; and
identifying, based on the segmenting, one or more sequences of data signatures connected to the particular data signature, each of the one or more sequences being different from one or more other identified sequences of data signatures connected to the particular data signature in the data structure, each of the one or more sequences including a connected set of data signatures in the data structure that are connected to the particular data signature, wherein each connection represents a relationship between a first data signature and a second data signature of the set of data signatures;
generating, based on the type of the data signatures and independent of precoding any sequence of data signatures into a rule, a logical ruleset, wherein each logical rule of the logical ruleset is defined by a sequence of data signatures of the one or more sequences of data signatures that are identified, and wherein a logical rule of the logical ruleset comprises a shape rule specifying one or a sequence of more shapes that are either permitted to be adjacent in the sequence of data signatures, restricted from being adjacent in the sequence of data signatures, or replacing one or more prior shapes of the sequence of data signatures;
executing one or more classifiers against the logical ruleset to classify the one or more data items received by the input port; and
generating, based on the executing, one or more additional logical rules for the logical ruleset, the one or more additional logical rules specifying an additional relationship between at least two shapes of the plurality of shapes.