| CPC G06F 16/353 (2019.01) [G06N 3/08 (2013.01)] | 18 Claims |

|
1. A method of classifying textual data blocks by at least one processor, the method comprising:
receiving textual data blocks in an original version, each of the textual data blocks comprising a plurality of textual data elements;
performing a preprocessing procedure on each of the textual data blocks in the original version, wherein the preprocessing procedure comprises:
replacing, for each of the textual data blocks, each of the textual data elements characterized by presence of a specific character or sequence of characters with a respective character-based token to generate a first partial tokenized version of the respective textual data block;
replacing, for each first partial tokenized version of the textual data blocks, each of the first partial tokenized versions further characterized by a specific contextual definition with a respective context-based token to generate a second partial tokenized version of the respective first partial tokenized first; and
replacing, for each second partial tokenized version of the textual data blocks, each of the second partial tokenized versions further characterized by pertinence to at least one specific part-of-speech (POS) category with a respective POS token, thereby obtaining, for each of the textual data blocks, the textual data block in a preprocessed tokenized version of the respective textual data block;
forming a training dataset comprising the textual data blocks in the preprocessed tokenized version labeled with an indication of pertinence to at least once class indicative of an email signature block;
training a machine learning-based (ML-based) model to classify textual data blocks by pertinence to the at least one class, based on the training dataset, wherein the ML-based model comprises an artificial neural network;
receiving a new textual data block in an original version, the new textual data block comprising a plurality of textual data elements and performing the preprocessing procedure to obtain, for the new textual data block, the new textual data block in the preprocessed tokenized version;
performing machine learning, using the trained ML-based model, on the new textual data block in the preprocessed tokenized version to classify the new textual data block by pertinence to the at least one class.
|