US 12,131,745 B1
System and method for automatic alignment of phonetic content for real-time accent conversion
Lukas Pfeifenberger, Salzburg (AT); and Shawn Zhang, Palo Alto, CA (US)
Assigned to SANAS.AI INC., Palo Alto, CA (US)
Filed by Sanas.ai Inc., Palo Alto, CA (US)
Filed on Jun. 26, 2024, as Appl. No. 18/754,280.
Claims priority of provisional application 63/510,487, filed on Jun. 27, 2023.
Int. Cl. G10L 21/007 (2013.01); G06F 3/16 (2006.01); G10L 13/00 (2006.01); G10L 13/033 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/26 (2006.01); G10L 21/003 (2013.01); G10L 21/01 (2013.01); G10L 21/013 (2013.01)
CPC G10L 21/007 (2013.01) [G06F 3/162 (2013.01); G10L 13/00 (2013.01); G10L 13/033 (2013.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 15/26 (2013.01); G10L 21/003 (2013.01); G10L 21/013 (2013.01); G10L 21/01 (2013.01); G10L 2021/0135 (2013.01)] 20 Claims
OG exemplary drawing
 
1. An accent conversion system, comprising an audio interface, memory having instructions stored thereon, and one or more processors coupled to the memory and configured to execute the instructions to:
obtain input audio data via the audio interface;
generate from the input audio data first phonetic embedding vectors for phonetic content representing a source accent;
apply a trained accent conversion neural network to the first phonetic embedding vectors to generate second phonetic embedding vectors corresponding to first phonetic characteristics of speech data in a target accent;
determine a differentiable alignment by jointly maximizing a cosine distance between the first phonetic embedding vectors and the second phonetic embedding vectors; and
align the speech data to the phonetic content based on the differentiable alignment to generate and provide output audio data corresponding to the aligned speech data and representing the target accent.