US 12,242,807 B2
Tokenizing alphanumeric text through use of finite state machines
Siarhei Alonichau, Seattle, WA (US); and Junaid Ahmed, Bellevue, WA (US)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Mar. 2, 2021, as Appl. No. 17/190,344.
Prior Publication US 2022/0284190 A1, Sep. 8, 2022
Int. Cl. G06F 40/284 (2020.01); G06F 40/237 (2020.01); G06F 40/289 (2020.01)
CPC G06F 40/284 (2020.01) 10 Claims
OG exemplary drawing
 
1. A computing system comprising:
a processor; and
memory storing instructions that, when executed by the processor, cause the processor to perform acts comprising:
receiving a request to tokenize alphanumeric text that includes a word; and
tokenizing the alphanumeric text such that a sequence of numeric identifiers that represents the alphanumeric text is output, wherein tokenizing the alphanumeric text comprises:
providing the alphanumeric text to a computer-implemented finite state machine, where the finite state machine includes a final state;
generating at least one token for the word based upon a value assigned to the final state of the computer-implemented finite state machine, where each token in the at least one token is included in a predefined vocabulary employed by the computing system when tokenizing text; and
outputting at least one numeric identifier as a representation of the word based upon the at least one token,
wherein the at least one numeric identifier is identified from a predefined set of numeric identifiers, and further wherein a class is assigned to the alphanumeric text by a computer-implemented text classifier based upon the output sequence of numeric identifiers.