US 12,223,973 B1
	Speech conversion method and apparatus, storage medium, and electronic device
Huapeng Sima, Nanjing (CN); Ao Yao, Nanjing (CN); and Yiping Tang, Nanjing (CN)
Assigned to NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Nanjing (CN)
Filed by NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Nanjing (CN)
Filed on Aug. 9, 2024, as Appl. No. 18/798,929.
Claims priority of application No. 202311826046.4 (CN), filed on Dec. 28, 2023.
Int. Cl. G10L 13/033 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/18 (2013.01); G10L 21/007 (2013.01)

CPC G10L 21/007 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/1807 (2013.01); G10L 2015/0631 (2013.01)]

7 Claims

1. A speech conversion method comprising:

acquiring a source speech to be converted and a target speech sample of a target speaker;

recognizing a style category of the target speech sample by an audio feature encoding module, and extracting a target audio feature from the target speech sample according to the style category of the target speech sample, wherein the target audio feature includes a textual feature, a prosodic feature and a timbre feature of the target speech sample;

extracting a source audio feature from the source speech by the audio feature encoding module, wherein the source audio feature includes a textual feature, a prosodic feature and a timbre feature of the source speech;

acquiring a first style feature of the target speech sample by a style feature encoding module and determining a second style feature of the target speech sample according to the first style feature, wherein the first style feature is used to indicate a static voice characteristic of the target speech sample, and the second style feature is used to indicate predicted values for a feature bias amount and a gain amount of the first style feature within a preset duration;

fusing and mapping the source audio feature of the source speech, the target audio feature of the target speech sample, and the second style feature of the target speech sample to obtain a joint encoding feature; and

decoding the joint encoding feature on which a standard streaming operation is performed, to obtain a target speech feature corresponding to a speaking style of the target speaker, and converting the source speech based on the target speech feature to obtain a target speech,

wherein, before the extracting audio features from the source speech by the audio feature encoding module, the method further comprises:

training a first clustering model by using first training samples, wherein the first training samples include speech samples of a plurality of speakers, and the speech samples of the plurality of speakers correspond to different style types, and wherein the first clustering model is configured for clustering the first training samples and determining, according to a result of the clustering, category labels corresponding to the first training samples;

training a second clustering model by using second training samples, wherein the second training samples include speech samples of a plurality of speakers, and the speech samples of the plurality of speakers correspond to different style types, and wherein the second clustering model is configured for clustering the second training samples and determining, according to a result of the clustering, category labels corresponding to the second training samples, the first clustering model and the second clustering model using different structures of feature extractors to perform clustering from different dimensions;

inputting third training samples into the trained first clustering model, the trained second clustering model, and an initial audio feature encoding module, wherein the third training samples include speech samples of a plurality of speakers; and

training the initial audio feature encoding module according to a loss function for the initial audio feature encoding module, and training the initial audio feature encoding module to convergence according to actual category labels output from the first clustering model and the second clustering model and the predicted category labels output from the initial audio feature encoding module to obtain the audio feature encoding module, wherein the audio feature encoding module is configured for performing audio feature extraction based on the style type of speech.