US 11,887,061 B1
Systems and methods for determination, description, and use of feature sets for machine learning classification systems, including electronic messaging systems employing machine learning classification
Peter Gallagher McNeil, Leesburg, VA (US)
Assigned to ZIX CORPORATION, Dallas, TX (US)
Filed by Zix Corporation, Dallas, TX (US)
Filed on Jan. 27, 2023, as Appl. No. 18/160,496.
Int. Cl. G06Q 10/107 (2023.01); G06F 15/16 (2006.01); G09B 19/00 (2006.01); G09B 5/02 (2006.01); H04L 51/21 (2022.01)
CPC G06Q 10/107 (2013.01) [H04L 51/21 (2022.05)] 15 Claims
OG exemplary drawing
 
1. A method of feature set extraction, comprising:
receiving a corpus of training content;
parsing at least a portion of the training content to generate a plurality of features and a plurality of cross-tabulations for the features, wherein the training content comprises a plurality of training emails and the plurality of features comprises email addresses of the plurality of training emails, the email addresses comprising an email from address, an email to address, an email reply to address, an email cc address and an email bcc address, wherein the email addresses are divided into segments comprising a domain segment, a local segment, and a friendly segment;
generating a value for each of the plurality of features by generating a unique training hash for the domain segment, a unique training hash for the local segment, and a unique training hash for the friendly segment of each of the email addresses in the plurality of training emails, each unique training hash associated with a feature value and a feature structure;
generating a classification value for each of the cross-tabulations of the features;
receiving a corpus of live content;
parsing the live content based on the plurality of features to identify one or more of the features in the live content;
generating a value for each of the identified one or more features in the live content;
generating a classification value for each of the cross-tabulations of the identified one or more features in the live content; and
based on the generated classification value for each of the cross-tabulations of the identified one or more features in the live content, classifying the live content.