US 11,875,775 B2
	Voice conversion system and training method therefor
Huapeng Sima, Nanjing (CN); Zhiqiang Mao, Nanjing (CN); and Xuefei Gong, Nanjing (CN)
Assigned to Nanjing Silicon Intelligence Technology Co., Ltd., Nanjing (CN)
Appl. No. 17/430,793
Filed by NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Nanjing (CN)
PCT Filed Apr. 20, 2021, PCT No. PCT/CN2021/088507 § 371(c)(1), (2) Date Aug. 13, 2021, PCT Pub. No. WO2022/083083, PCT Pub. Date Apr. 28, 2022.
Claims priority of application No. 202011129857.5 (CN), filed on Oct. 21, 2020.
Prior Publication US 2022/0310063 A1, Sep. 29, 2022
Int. Cl. G10L 25/24 (2013.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01)

CPC G10L 15/063 (2013.01) [G10L 15/16 (2013.01); G10L 25/24 (2013.01)]

19 Claims

1. A voice conversion system, comprising:

a speaker-independent automatic speech recognition model, comprising at least a bottleneck layer, configured to: convert a mel-scale frequency cepstral coefficients feature of an inputted source speech into a bottleneck feature of the source speech through the bottleneck layer, and output the bottleneck feature of the source speech to an Attention voice conversion network through the bottleneck layer;

where a training method for the speaker-independent automatic speech recognition model comprises:

inputting a number of a character encoding to which a word in a multi-speaker speech recognition training corpus is converted, together with a mel-scale frequency cepstral coefficients feature of the multi-speaker speech recognition training corpus, to the speaker-independent automatic speech recognition model; executing a backward propagation algorithm; and performing iterative optimization until the speaker-independent automatic speech recognition model is converged;

the Attention voice conversion network configured to convert the bottleneck feature of the source speech into a mel-scale frequency cepstral coefficients feature in conformity with a target speech; and

a neural network vocoder configured to convert the mel-scale frequency cepstral coefficients feature in conformity with the target speech into and output audio.