US 11,935,519 B2
	Preserving speech hypotheses across computing devices and/or dialog sessions
Matthew Sharifi, Kilchberg (CH); and Victor Carbune, Zurich (CH)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 15, 2020, as Appl. No. 16/949,151.
Prior Publication US 2022/0122589 A1, Apr. 21, 2022
Int. Cl. G10L 15/00 (2013.01); G10L 15/14 (2006.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01)

CPC G10L 15/14 (2013.01) [G10L 15/22 (2013.01); G10L 15/26 (2013.01)]

18 Claims

1. A method implemented by one or more processors of a computing device of a user, the method comprising:

receiving, via one or more microphones of the computing device of the user, first audio data corresponding to a first spoken utterance of the user;

processing, using a corresponding on-device automatic speech recognition (ASR) model that is stored in on-device memory of the computing device, the first audio data corresponding to the first spoken utterance to generate, for a given first part of the first spoken utterance, a plurality of first speech hypotheses based on first values generated using the corresponding on-device ASR model that is stored in the on-device memory of the computing device;

selecting, from among the plurality of first speech hypotheses, a given first speech hypothesis, the given first speech hypothesis being predicted to correspond to the given first part of the first spoken utterance based on the first values;

causing the given first speech hypothesis to be incorporated as a first portion of a transcription, the transcription being visually rendered at a user interface of the computing device of the user;

determining that the first spoken utterance is complete;

in response to determining that the first spoken utterance is complete, storing one or more first alternate speech hypotheses in the on-device memory of the computing device, the one or more first alternate speech hypotheses including a subset of the plurality of first speech hypotheses that excludes at least the given first speech hypothesis;

receiving, via one or more of the microphones of the computing device, second audio data corresponding to a second spoken utterance of the user; and

in response to receiving the second audio data:

loading one or more of the first alternate speech hypotheses from the on-device memory of the computing device;

processing, using the corresponding on-device ASR model that is stored in on-device memory of the computing device, the second audio data corresponding to the second spoken utterance to generate, for a given second part of the second spoken utterance, a plurality of second speech hypotheses based on second values generated using the corresponding on-device ASR model that is stored in the on-device memory of the computing device;

selecting, from among the plurality of second speech hypotheses, a given second speech hypothesis, the given second speech hypothesis being predicted to correspond to the given second part of the second spoken utterance based on the second values;

causing the given second speech hypothesis to be incorporated as a second portion of the transcription;

determining, based on the given second speech hypothesis that is incorporated into the transcription as the second portion of the transcription, whether to modify the first portion of the transcription; and

in response to determining to modify the first portion of the transcription:

modifying the first portion of the transcription, that was initially predicted to correspond to the given first part of the first spoken utterance, to include a given alternate first speech hypothesis, from among the one or more alternate first speech hypotheses, that is subsequently predicted to correspond to the given first part of the first spoken utterance.