US 11,996,085 B2
	Enhanced speech endpointing
Petar Aleksic, Jersey City, NJ (US); Glen Shires, Danville, CA (US); and Michael Buchanan, London (GB)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 8, 2020, as Appl. No. 17/115,403.
Application 17/115,403 is a continuation of application No. 15/711,260, filed on Sep. 21, 2017, granted, now 10,885,898.
Application 15/711,260 is a continuation of application No. 15/192,431, filed on Jun. 24, 2016, abandoned.
Application 15/192,431 is a continuation of application No. 14/844,563, filed on Sep. 3, 2015, granted, now 10,339,917, issued on Jul. 2, 2019.
Prior Publication US 2021/0090554 A1, Mar. 25, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/04 (2013.01); G06F 3/16 (2006.01); G10L 15/05 (2013.01); G10L 15/08 (2006.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 25/78 (2013.01)

CPC G10L 15/05 (2013.01) [G06F 3/167 (2013.01); G10L 15/04 (2013.01); G10L 15/1815 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 25/78 (2013.01); G10L 2015/088 (2013.01); G10L 2025/783 (2013.01)]

16 Claims

1. A computer-implemented method that when executed on data processing hardware cause the data processing hardware to perform operations comprising:

receiving audio data of an utterance spoken by a user, the audio data captured by a client device;

generating, using an automated speech recognizer (ASR), a first intermediate speech recognition result by performing speech recognition on the audio data of the utterance, the ASR configured to endpoint utterances by terminating performance of speech recognition on received audio data based on detecting non-speech for at least an end of speech (EOS) timeout duration;

while receiving the audio data of the utterance and before detecting non-speech for at least the EOS timeout duration:

determining, using the ASR, a confidence level associated with the first intermediate recognition result generated by the ASR, the confidence level corresponding to a confidence of an accuracy of the first intermediate recognition result;

determining an expected speech recognition result based on context data of the client device;

based on the confidence level associated with the first intermediate speech recognition result generated by the ASR, determining that the first intermediate speech recognition result partially matches the expected speech recognition result; and

extending the EOS timeout duration by a predetermined amount of time based on determining that the first intermediate speech recognition result partially matches the expected speech recognition result;

receiving additional audio data of the utterance spoken by the user;

generating, using the ASR, a second intermediate speech recognition result by performing speech recognition on the additional audio data of the utterance spoken by the user; and

while receiving the additional audio data and before detecting non-speech for at least the extended EOS timeout duration:

determining that the second intermediate speech recognition result matches the expected speech recognition result; and

based on determining that the second intermediate speech recognition result matches the expected speech recognition result:

terminating performance of any speech recognition subsequent to generating the second intermediate speech recognition result by truncating any additional audio data received after generating the second intermediate speech recognition result; and

deactivating a microphone of the client device.