US 12,462,100 B2
Intelligent classification of text-based content
Tonu Vanatalu, Tallinn (EE)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Apr. 21, 2023, as Appl. No. 18/137,933.
Prior Publication US 2024/0354500 A1, Oct. 24, 2024
Int. Cl. G06F 40/205 (2020.01); G06N 20/00 (2019.01)
CPC G06F 40/205 (2020.01) [G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. One or more non-transitory computer-readable media having encoded thereon computer-executable instructions for causing one or more processors, when programmed thereby, to perform operations comprising:
receiving text-based content comprising a plurality of characters;
generating a plurality of character category sequences for the text-based content using the plurality of characters, wherein each character category sequence, among the plurality of character category sequences, comprises multiple character category identifiers for multiple characters, respectively, of a different character sequence in the text-based content, and wherein each character category identifier, among the multiple character category identifiers, identifies one of a plurality of predefined character categories;
calculating a frequency distribution of the plurality of character category sequences, wherein the frequency distribution represents relative frequencies of occurrence of unique character category sequences among a total number of the plurality of character category sequences; and
classifying, using a machine learning model, the text-based content into one of multiple classes based on the calculated frequency distribution of the plurality of character category sequences, wherein a total number of the plurality of predefined character categories is smaller than a total number of possible unique characters such that dimension of input to the machine learning model is reduced as compared with input based on a frequency distribution of character sequences for the text-based content.