US 11,947,909 B2
Training a language detection model for language autodetection from non-character sub-token signals
Andrew Stuart Glass, Seattle, WA (US); Margaret Hope Magnus, Everett, WA (US); and Roland Radtke, Brier, WA (US)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Apr. 17, 2023, as Appl. No. 18/301,341.
Application 18/301,341 is a continuation of application No. 17/839,330, filed on Jun. 13, 2022, granted, now 11,630,951.
Application 17/839,330 is a continuation of application No. 17/024,428, filed on Sep. 17, 2020, granted, now 11,361,158, issued on Jun. 14, 2022.
Prior Publication US 2023/0252235 A1, Aug. 10, 2023
Int. Cl. G06F 40/263 (2020.01)
CPC G06F 40/263 (2020.01) 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for training a language detection model to detect a language of text, the method comprising:
a first training phase, comprising:
identifying a list of tokens from a corpus, the identifying comprising:
extracting affixes from each word in the corpus and storing each unique affix as a token in the list of tokens; and
identifying word stems after the extracting the affixes and storing each unique word stem as a token in the list of tokens; and
a second training phase, comprising:
assigning weights from the corpus to the list of tokens by training the list of tokens against the corpus using a weighting engine.