US 12,087,306 B1
Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion
Duc Hoang Le, Sunnyvale, CA (US); FNU Mahaveer, Foster City, CA (US); Gil Keren, San Francisco, CA (US); Christian Fuegen, Kingston Upon Thames (GB); and Yatharth Saraf, Redwood City, CA (US)
Assigned to Meta Platforms, Inc., Menlo Park, CA (US)
Filed by Meta Platforms, Inc., Menlo Park, CA (US)
Filed on Nov. 24, 2021, as Appl. No. 17/535,005.
Int. Cl. G10L 15/16 (2006.01); G10L 15/28 (2013.01)
CPC G10L 15/28 (2013.01) [G10L 15/16 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising, by a computing system:
receiving an utterance spoken by a user, the utterance comprising a word in a custom vocabulary list of the user;
generating a previous token to represent a previous audio portion of the utterance; and
generating a current token to represent a current audio portion of the utterance by:
generating a bias embedding by using the previous token to query a trie of wordpieces representing the custom vocabulary list, wherein the trie is based on biasing words;
generating, based on the bias embedding and the current audio portion, first probabilities of respective first candidate tokens likely uttered in the current audio portion;
generating, based on the previous token and the bias embedding, second probabilities of respective second candidate tokens likely uttered after the previous token; and
generating, based on the first probabilities of the respective first candidate tokens and the second probabilities of the respective second candidate tokens, the current token to represent the current audio portion of the utterance.