| CPC G06F 40/205 (2020.01) [G06N 20/00 (2019.01)] | 20 Claims |

|
1. One or more non-transitory computer-readable media having encoded thereon computer-executable instructions for causing one or more processors, when programmed thereby, to perform operations comprising:
receiving text-based content comprising a plurality of characters;
generating a plurality of character category sequences for the text-based content using the plurality of characters, wherein each character category sequence, among the plurality of character category sequences, comprises multiple character category identifiers for multiple characters, respectively, of a different character sequence in the text-based content, and wherein each character category identifier, among the multiple character category identifiers, identifies one of a plurality of predefined character categories;
calculating a frequency distribution of the plurality of character category sequences, wherein the frequency distribution represents relative frequencies of occurrence of unique character category sequences among a total number of the plurality of character category sequences; and
classifying, using a machine learning model, the text-based content into one of multiple classes based on the calculated frequency distribution of the plurality of character category sequences, wherein a total number of the plurality of predefined character categories is smaller than a total number of possible unique characters such that dimension of input to the machine learning model is reduced as compared with input based on a frequency distribution of character sequences for the text-based content.
|