US 12,087,279 B2
Regularizing word segmentation
Bhuvana Ramabhadran, Mt. Kisco, NY (US); Hainan Xu, Mountain View, CA (US); Kartik Audhkhasi, Mountain View, CA (US); and Yinghui Huang, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 23, 2022, as Appl. No. 17/656,225.
Claims priority of provisional application 63/166,958, filed on Mar. 26, 2021.
Prior Publication US 2022/0310061 A1, Sep. 29, 2022
Int. Cl. G10L 15/02 (2006.01); G06F 40/284 (2020.01); G06N 3/04 (2023.01); G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 25/30 (2013.01)
CPC G10L 15/04 (2013.01) [G06F 40/284 (2020.01); G06N 3/04 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 25/30 (2013.01); G10L 15/02 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving an input word to be segmented into a plurality of subword units; and
executing a subword segmentation routine to segment the input word into a plurality of subword units by:
accessing a trained vocabulary set of subword units; and
selecting the plurality of subword units from the input word by greedily finding a longest subword unit from the input word that is present in the trained vocabulary set until an end of the input word is reached, wherein selecting the plurality of subword units comprises, for each corresponding position of a plurality different positions of the input word:
identifying all possible candidate subword units from the input word at the corresponding position that are present in the trained vocabulary set; and
randomly sampling from all of the possible candidate subword units by assigning a 1−p probability to a longest one the possible candidate subword units and dividing a rest of the p probability evenly among all of the possible candidate subword units from the input word at the corresponding position.