| CPC G06F 40/284 (2020.01) [G06F 16/322 (2019.01); G06F 40/40 (2020.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
performing, by one or more processors of a processing system, tokenization of a string of text, comprising:
analyzing a set of nodes of a vocabulary trie structure to identify one or more links between nodes of the vocabulary trie structure corresponding to one or more characters of the string;
identifying a fail link between a pair of nodes in the set of nodes;
storing, based on the fail link, a first token representing a word or wordpiece comprised of a pair of characters of the string;
storing a second token representing a word or wordpiece comprised of another character of the string; and
concatenating the first token and the second token to form an array of tokens; and
providing, by the one or more processors, the array of tokens to a neural network for natural language processing.
|