US 12,353,839 B2
On-device streaming inverse text normalization (ITN)
Yashesh Gaur, Redmond, WA (US); Nicholas Kibre, Redwood City, CA (US); Issac J. Alphonso, San Jose, CA (US); Jian Xue, Bellevue, WA (US); Jinyu Li, Sammamish, WA (US); Piyush Behre, Santa Clara, CA (US); and Shuangyu Chang, Davis, CA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Mar. 11, 2022, as Appl. No. 17/693,267.
Prior Publication US 2023/0289536 A1, Sep. 14, 2023
Int. Cl. G06F 40/00 (2020.01); G06F 40/284 (2020.01); G06F 40/56 (2020.01); G10L 15/08 (2006.01)
CPC G06F 40/56 (2020.01) [G06F 40/284 (2020.01); G10L 15/08 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A system comprising:
a processor; and
a computer-readable medium storing instructions that are operative upon execution by the processor to:
receive a stream of tokens, each token representing an element of human speech;
chunk the stream of tokens;
tag, by a tagger, the stream of tokens with one or more tags of a plurality of tags to produce a tagged stream of tokens by chunks in a streaming manner, each tag of the plurality of tags representing a different normalization category of a plurality of normalization categories, wherein the tagger comprises a neural network using self-attention to compute representations of input and output;
detect, by each category-specific natural language converter of a plurality of category-specific natural language converters, each of the plurality of category-specific natural language converters comprising a weighted finite state transducer (WFST), from the tagged stream of tokens, a tag representing a normalization category of the plurality of normalization categories upon which each category-specific natural language converter is trained to operate, wherein each category-specific natural language converter is trained for a single normalization category of the plurality of normalization categories by each respective trainer of a plurality of trainers;
upon detecting a first tag representing a first normalization category, convert, by a first language converter of the plurality of category-specific natural language converters, a first token of the tagged stream of tokens, from a first lexical language form to a first natural language form, wherein the first language converter is trained to operate upon the first normalization category, and wherein the first token is associated with the first tag;
upon detecting a second tag representing a second normalization category, convert, in parallel with converting by the first language converter, by a second language converter of the plurality of category-specific natural language converters, a second token of the tagged stream of tokens from a second lexical language form to a second natural language form, wherein the second language converter is trained to operate upon the second normalization category, and wherein the second token is associated with the second tag; and
based on at least the first natural language form, output a natural language representation of the stream of tokens.