US 11,676,625 B2
Unified endpointer using multitask and multidomain learning
Shuo-Yiin Chang, Sunnyvale, CA (US); Bo Li, Fremont, CA (US); Gabor Simko, Santa Clara, CA (US); Maria Carolina Parada San Martin, Boulder, CO (US); and Sean Matthew Shannon, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 20, 2021, as Appl. No. 17/152,918.
Application 17/152,918 is a continuation of application No. 16/711,172, filed on Dec. 11, 2019, granted, now 10,929,754.
Application 16/711,172 is a continuation in part of application No. 16/001,140, filed on Jun. 6, 2018, granted, now 10,593,352, issued on Mar. 17, 2020.
Claims priority of provisional application 62/515,771, filed on Jun. 6, 2017.
Prior Publication US 2021/0142174 A1, May 13, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/16 (2006.01); G10L 25/78 (2013.01); G06N 3/08 (2023.01); G06N 20/20 (2019.01); G06N 5/046 (2023.01); G06F 18/214 (2023.01); G06N 3/045 (2023.01)
CPC G10L 25/78 (2013.01) [G06F 18/214 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 5/046 (2013.01); G06N 20/20 (2019.01); G10L 15/16 (2013.01)] 22 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving, as input to a multidomain endpointer model, a sequence of audio features representing an utterance captured by a microphone of a user device, the multidomain endpointer model comprising a shared neural network trained on:
a first training set of short-form speech utterances; and
a second training set of long-form speech utterances;
generating, as output from the multidomain endpointer model, a sequence of predicted end-of-query (EOQ) speech labels comprising a predicted EOQ speech label, a predicted EOQ initial silence label, a predicted EOQ intermediate silence label, and a predicted EOQ final silence label; and
when the predicted EOQ final silence label is output from the multidomain endpointer model, obtaining a hard microphone closing decision that causes the user device to endpoint the utterance by deactivating the microphone.