US 12,322,374 B2
Phrase-based end-to-end text-to-speech (TTS) synthesis
Ran Zhang, Redmond, WA (US); Jian Luan, Beijing (CN); and Yahuan Cong, Redmond, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/919,982
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Mar. 19, 2021, PCT No. PCT/US2021/023054
§ 371(c)(1), (2) Date Oct. 19, 2022,
PCT Pub. No. WO2021/242366, PCT Pub. Date Dec. 2, 2021.
Claims priority of application No. 202010460593.5 (CN), filed on May 26, 2020.
Prior Publication US 2023/0169953 A1, Jun. 1, 2023
Int. Cl. G10L 13/08 (2013.01); G10L 13/04 (2013.01)
CPC G10L 13/08 (2013.01) [G10L 13/04 (2013.01)] 13 Claims
OG exemplary drawing
 
1. A method for phrase-based end-to-end text-to-speech (TTS) synthesis, comprising:
obtaining a text;
identifying a target phrase in the text;
determining a phrase context of the target phrase;
obtaining a reference audio;
generating an acoustic feature corresponding to the target phrase based at least on the target phrase, the phrase context and the reference audio, wherein the generating an acoustic feature comprises:
generating a context embedding representation of the phrase context;
generating an acoustic embedding representation of the reference audio; and
generating the acoustic feature through an acoustic model conditioned by the context embedding representation and the acoustic embedding representation; and
generating a speech waveform corresponding to the target phrase based on the acoustic feature.