US 12,346,364 B2
System and method for classifying textual data blocks
Joy Chen, Menlo Park, CA (US); and Igal Mazor, Tel-Aviv (IL)
Assigned to Genesys Cloud Services, Inc., Menlo Park, CA (US)
Filed by GENESYS CLOUD SERVICES, INC., Menlo Park, CA (US)
Filed on Dec. 22, 2022, as Appl. No. 18/087,069.
Prior Publication US 2024/0211503 A1, Jun. 27, 2024
Int. Cl. G06F 16/35 (2025.01); G06F 16/353 (2025.01); G06N 3/08 (2023.01)
CPC G06F 16/353 (2019.01) [G06N 3/08 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method of classifying textual data blocks by at least one processor, the method comprising:
receiving textual data blocks in an original version, each of the textual data blocks comprising a plurality of textual data elements;
performing a preprocessing procedure on each of the textual data blocks in the original version, wherein the preprocessing procedure comprises:
replacing, for each of the textual data blocks, each of the textual data elements characterized by presence of a specific character or sequence of characters with a respective character-based token to generate a first partial tokenized version of the respective textual data block;
replacing, for each first partial tokenized version of the textual data blocks, each of the first partial tokenized versions further characterized by a specific contextual definition with a respective context-based token to generate a second partial tokenized version of the respective first partial tokenized first; and
replacing, for each second partial tokenized version of the textual data blocks, each of the second partial tokenized versions further characterized by pertinence to at least one specific part-of-speech (POS) category with a respective POS token, thereby obtaining, for each of the textual data blocks, the textual data block in a preprocessed tokenized version of the respective textual data block;
forming a training dataset comprising the textual data blocks in the preprocessed tokenized version labeled with an indication of pertinence to at least once class indicative of an email signature block;
training a machine learning-based (ML-based) model to classify textual data blocks by pertinence to the at least one class, based on the training dataset, wherein the ML-based model comprises an artificial neural network;
receiving a new textual data block in an original version, the new textual data block comprising a plurality of textual data elements and performing the preprocessing procedure to obtain, for the new textual data block, the new textual data block in the preprocessed tokenized version;
performing machine learning, using the trained ML-based model, on the new textual data block in the preprocessed tokenized version to classify the new textual data block by pertinence to the at least one class.