US 12,087,268 B1
	Identity transfer models for generating audio/video content
Wenbin Ouyang, Redmond, WA (US); and Naveen Sudhakaran Nair, Issaquah, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Dec. 3, 2021, as Appl. No. 17/541,996.
Int. Cl. G10L 13/02 (2013.01); G06N 3/08 (2023.01); G10L 17/18 (2013.01); G10L 21/013 (2013.01); G10L 21/10 (2013.01)

CPC G10L 13/02 (2013.01) [G06N 3/08 (2013.01); G10L 17/18 (2013.01); G10L 21/013 (2013.01); G10L 21/10 (2013.01)]

16 Claims

1. A computer-implemented method, comprising:

obtaining a plurality of audio samples of songs associated with respective identities;

determining, based on the plurality of audio samples of songs and using a first encoder, multiple voice identity embeddings for the respective identities, including a first voice identity embedding associated with the first identity;

recording a first audio data sample from a user;

determining a second voice identity embedding for the user based on the first audio data;

storing, in an identity store, the multiple voice identity embeddings associated with the respective identities, including the first voice embedding for the first identity;

obtaining a request, from the user, to generate synthesized song audio, wherein the request specifies:

audio song content associated with a second identity; and

the first identity, wherein the first identity corresponds to the user, and the first voice identity embedding in the identity store is based on the first audio data;

determining a spectrogram of the audio song content;

determining, based on the spectrogram and using a second encoder, a content embedding associated with the audio song content; and

generating, using a second decoder corresponding to the second encoder and based on the first voice identity embedding and the content embedding, the synthesized song audio.