| CPC G10L 15/05 (2013.01) [G10L 13/00 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); B65G 1/127 (2013.01); G06Q 10/087 (2013.01); G06Q 30/0185 (2013.01); G06Q 30/02 (2013.01); G10L 2015/227 (2013.01); H04M 3/4938 (2013.01); H04R 27/00 (2013.01)] | 18 Claims |

|
1. A method implemented by one or more processors, the method comprising:
processing an audio data stream capturing a spoken utterance of a user where the audio data stream is captured via one or more microphones of a client device;
detecting a candidate endpoint in the audio data stream; and
determining whether the candidate endpoint is an actual endpoint based on:
a text representation of a portion of the spoken utterance immediately preceding the candidate endpoint, and
a user-specific measure that is based on the text representation and one or more historical interactions with the user, where each of the one or more historical interactions is based on processing a previous audio data stream capturing the user speaking a previous instance of the spoken utterance, where the previous instance of the user speaking the spoken utterance captured in the previous audio data stream is the same spoken utterance spoken by the user captured in the audio data stream, where the historical interactions are specific to the text representation and the user, and where the historical interactions each indicate whether a previous instance of the text representation was a previous endpoint for the user.
|