US 12,136,410 B2
Speaker embeddings for improved automatic speech recognition
Fadi Biadsy, Mountain View, CA (US); Dirk Ryan Padfield, Seattle, WA (US); and Victoria Zayats, Seattle, WA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 3, 2022, as Appl. No. 17/661,832.
Prior Publication US 2023/0360632 A1, Nov. 9, 2023
Int. Cl. G10L 13/08 (2013.01); G10L 13/04 (2013.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 25/18 (2013.01)
CPC G10L 13/08 (2013.01) [G10L 13/04 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 25/18 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech;
generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker, the speaker embedding conveying speaker characteristics of the target speaker;
determining, using the speaker embedding network, a personalization embedding for the target speaker based on the speaker embedding, the personalization embedding corresponding to a respective style cluster of speaker embeddings extracted from training utterances spoken by speakers that convey speaker characteristics similar to the speaker characteristics conveyed by the speaker embedding;
receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech; and
biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker,
wherein biasing the speech conversion model comprises using the personalized embedding determined for the target speaker to bias the speech conversion model for a type of the atypical speech associated with the target speaker.