CPC G10L 13/027 (2013.01) [G10L 13/10 (2013.01)] | 22 Claims |
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving a text utterance having one or more words, each word having one or more syllables, each syllable having one or more phonemes;
receiving as input to a vocoder model:
prosodic features output from a prosody model that represent an intended prosody for the text utterance, the prosodic features comprising a duration, pitch contour, and energy contour for the text utterance; and
a linguistic specification of the text utterance, the linguistic specification comprising sentence-level linguistic features for the text utterance, word-level linguistic features for each word of the text utterance, syllable-level linguistic features for each syllable of the text utterance, and phoneme-level linguistic features for each phoneme of the text utterance;
predicting, as output from the vocoder model, vocoder parameters based on the prosodic features output from the prosody model and the linguistic specification of the text utterance;
splitting the predicted vocoder parameters output from the vocoder model into Mel-cepstrum coefficients, aperiodicity components, and voicing components;
separately denormalizing the Mel-cepstrum coefficients, the aperiodicity components, and the voicing components;
concatenating the prosodic features output from the prosody model, the denormalized Mel- cepstrum coefficients, the denormalized aperiodicity components, and the denormalized voicing components into a vocoder vector; and
providing the vocoder vector to a parametric vocoder, the parametric vocoder configured to generate a synthesized speech representation of the text utterance and having the intended prosody.
|