US 12,456,012 B2
Inference methods for word or wordpiece tokenization
Xinying Song, Bellevue, WA (US); and Yang Song, Bellevue, WA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jun. 5, 2023, as Appl. No. 18/205,609.
Application 18/205,609 is a continuation of application No. 17/798,638, granted, now 11,763,083, previously published as PCT/US2020/033419, filed on May 18, 2020.
Prior Publication US 2024/0054288 A1, Feb. 15, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 40/284 (2020.01); G06F 16/31 (2019.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01)
CPC G06F 40/284 (2020.01) [G06F 16/322 (2019.01); G06F 40/40 (2020.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
performing, by one or more processors of a processing system, tokenization of a string of text, comprising:
analyzing a set of nodes of a vocabulary trie structure to identify one or more links between nodes of the vocabulary trie structure corresponding to one or more characters of the string;
identifying a fail link between a pair of nodes in the set of nodes;
storing, based on the fail link, a first token representing a word or wordpiece comprised of a pair of characters of the string;
storing a second token representing a word or wordpiece comprised of another character of the string; and
concatenating the first token and the second token to form an array of tokens; and
providing, by the one or more processors, the array of tokens to a neural network for natural language processing.