CPC G10L 21/007 (2013.01) [G06F 3/162 (2013.01); G10L 13/00 (2013.01); G10L 13/033 (2013.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 15/26 (2013.01); G10L 21/003 (2013.01); G10L 21/013 (2013.01); G10L 21/01 (2013.01); G10L 2021/0135 (2013.01)] | 20 Claims |
1. An accent conversion system, comprising an audio interface, memory having instructions stored thereon, and one or more processors coupled to the memory and configured to execute the instructions to:
obtain input audio data via the audio interface;
generate from the input audio data first phonetic embedding vectors for phonetic content representing a source accent;
apply a trained accent conversion neural network to the first phonetic embedding vectors to generate second phonetic embedding vectors corresponding to first phonetic characteristics of speech data in a target accent;
determine a differentiable alignment by jointly maximizing a cosine distance between the first phonetic embedding vectors and the second phonetic embedding vectors; and
align the speech data to the phonetic content based on the differentiable alignment to generate and provide output audio data corresponding to the aligned speech data and representing the target accent.
|