US 11,875,822 B1
Performance characteristic transfer for localized content
Rohun Tripathi, Seattle, WA (US); Angshuman Saha, Cupertino, CA (US); and Naveen Sudhakaran Nair, Issaquah, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on May 19, 2022, as Appl. No. 17/748,990.
Int. Cl. G11B 27/031 (2006.01); G06V 20/40 (2022.01); G06V 10/82 (2022.01); G06V 10/774 (2022.01); G10L 25/57 (2013.01); G10L 17/02 (2013.01); G06F 40/58 (2020.01); G10L 17/18 (2013.01); G10L 17/04 (2013.01); G10L 21/013 (2013.01); G10L 17/06 (2013.01)
CPC G11B 27/031 (2013.01) [G06F 40/58 (2020.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/06 (2013.01); G10L 17/18 (2013.01); G10L 21/013 (2013.01); G10L 25/57 (2013.01); G10L 2021/0135 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising: receiving first audio data associated with a first performance in a first language; dividing the first audio data into a plurality of audio clips; determining, using a processor, a voice identity score for an audio clip of the plurality of audio clips by using an autoencoder configured to minimize a reconstruction loss between a first output of the autoencoder and a randomly selected audio clip of the plurality of audio clips; determining a voice identity embedding for the first performance based on the voice identity score; generating second audio data by translating the first audio data from the first language into a second language, wherein translating the first audio data comprises using a machine learning model that implements the voice identity embedding of the autoencoder to preserve the voice identity; determining a performance characteristic score of the second audio data by using a twin neural network trained using contrastive learning based on a first training set and a second training set, wherein: the first training set comprises third audio data including positive examples; and the second training set comprises fourth audio data including negative examples, the negative examples identified based on the fourth audio data being temporally misaligned or the fourth audio data having an altered performance characteristic; and determining, using the processor, a loss based on a second output of the machine learning model and the performance characteristic score; and iteratively re-generating the second audio data, using the machine learning model, based on the loss.