| CPC G10L 13/08 (2013.01) [G10L 25/30 (2013.01)] | 20 Claims |

|
1. A method, comprising:
obtaining target text, wherein a phoneme of the target text comprises a first phoneme and a second phoneme that is adjacent to first phoneme;
performing feature extraction on the first phoneme and the second phoneme to obtain a first audio feature of the first phoneme and a second audio feature of the second phoneme;
obtaining, by using a target recurrent neural network (RNN) and based on the first audio feature, first speech data corresponding to the first phoneme, and obtaining, by using the target RNN and based on the second audio feature, second speech data corresponding to the second phoneme, wherein the first speech data and the second speech data are concurrently obtained; and
obtaining, by using a vocoder and based on the first speech data and the second speech data, audio corresponding to the first phoneme and audio corresponding to the second phoneme.
|