US 12,125,469 B2
	Predicting parametric vocoder parameters from prosodic features
Rakesh Iyer, Mountain View, CA (US); and Vincent Wan, London (GB)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 17, 2023, as Appl. No. 18/488,735.
Application 18/488,735 is a continuation of application No. 17/647,246, filed on Jan. 6, 2022, granted, now 11,830,474.
Application 17/647,246 is a continuation of application No. 17/033,783, filed on Sep. 26, 2020, granted, now 11,232,780, issued on Jan. 25, 2022.
Claims priority of provisional application 63/069,431, filed on Aug. 24, 2020.
Prior Publication US 2024/0046915 A1, Feb. 8, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 13/10 (2013.01); G10L 13/027 (2013.01)

CPC G10L 13/027 (2013.01) [G10L 13/10 (2013.01)]

22 Claims

1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving a text utterance having one or more words, each word having one or more syllables, each syllable having one or more phonemes;

receiving as input to a vocoder model:

prosodic features output from a prosody model that represent an intended prosody for the text utterance, the prosodic features comprising a duration, pitch contour, and energy contour for the text utterance; and

a linguistic specification of the text utterance, the linguistic specification comprising sentence-level linguistic features for the text utterance, word-level linguistic features for each word of the text utterance, syllable-level linguistic features for each syllable of the text utterance, and phoneme-level linguistic features for each phoneme of the text utterance;

predicting, as output from the vocoder model, vocoder parameters based on the prosodic features output from the prosody model and the linguistic specification of the text utterance;

splitting the predicted vocoder parameters output from the vocoder model into Mel-cepstrum coefficients, aperiodicity components, and voicing components;

separately denormalizing the Mel-cepstrum coefficients, the aperiodicity components, and the voicing components;

concatenating the prosodic features output from the prosody model, the denormalized Mel- cepstrum coefficients, the denormalized aperiodicity components, and the denormalized voicing components into a vocoder vector; and

providing the vocoder vector to a parametric vocoder, the parametric vocoder configured to generate a synthesized speech representation of the text utterance and having the intended prosody.