| CPC G10L 13/047 (2013.01) [G10L 15/16 (2013.01)] | 16 Claims |

|
1. A speech synthesis method, the speech synthesis method comprising:
acquiring a second set of speech data and a target text;
acquiring, using a text-to-speech synthesis model trained based on text data corresponding to a first set of speech data and at least a portion of the first set of speech data, a first set of information, wherein the first set of information includes a first set of embedding information comprising the second set of speech data;
acquiring, using the text-to-speech synthesis model, a second set of information, wherein:
the second set of information includes a second set of embedding information, comprising embeddings of the second set of speech data,
the second set of embedding information is acquired by deploying an attention mechanism using query components generated based on a sequence of the target text, and
acquiring the second set of information comprises:
encoding the target text, and
extracting the query components from the encoded target text. wherein:
the query components are generated based on a sequence of the encoded target text, and
the sequence is generated based on the first set of information and the encoded target text;
acquiring audio data, using the text-to-speech synthesis model, wherein the audio data:
corresponds to the target text, and
reflects characteristics of speech of a speaker of the second set of speech data, as a sound spectrum visualization generated based on the first set of information and the second set of information; and
deriving, using the text-to-speech synthesis model, a speech recording corresponding to the audio data.
|