US 11,948,570 B2
	Key phrase spotting
Wei Li, Mountain View, CA (US); Rohit Prakash Prabhavalkar, Santa Clara, CA (US); Kanury Kanishka Rao, Santa Clara, CA (US); Yanzhang He, Mountain View, CA (US); Ian C. Mcgraw, Menlo Park, CA (US); and Anton Bakhtin, New York, NY (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 9, 2022, as Appl. No. 17/654,195.
Application 17/654,195 is a continuation of application No. 16/527,487, filed on Jul. 31, 2019, granted, now 11,295,739.
Claims priority of provisional application 62/721,799, filed on Aug. 23, 2018.
Prior Publication US 2022/0199084 A1, Jun. 23, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/06 (2013.01); G10L 15/02 (2006.01); G10L 15/16 (2006.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 19/00 (2013.01); G10L 15/08 (2006.01); G10L 15/14 (2006.01)

CPC G10L 15/22 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/18 (2013.01); G10L 19/00 (2013.01); G10L 2015/025 (2013.01); G10L 2015/088 (2013.01); G10L 15/142 (2013.01); G10L 2015/223 (2013.01)]

20 Claims

1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving, as output from a key phrase encoder network, a key phrase encoding for a sequence of target sub-word units representing a key phrase;

for each corresponding frame of multiple frames representing an incoming audio signal:

processing, by an acoustic encoder network, the corresponding frame to generate an encoder output that represents an acoustic encoding of the corresponding frame;

generating, using an attention mechanism, a context vector for the corresponding frame based on the key phrase encoding output from the key phrase encoder network;

generating, using a prediction network that receives the context vector generated for the corresponding frame and a non-blank label previously output by a final softmax layer as input, an output vector for the corresponding frame; and

predicting, using the output vector generated for the corresponding frame and the encoder output generated for the corresponding frame, a sub-word unit; and

determining whether the incoming audio signal encodes an utterance of the key phrase based on the sub-word units predicted for the multiple frames.