US 12,322,374 B2
	Phrase-based end-to-end text-to-speech (TTS) synthesis
Ran Zhang, Redmond, WA (US); Jian Luan, Beijing (CN); and Yahuan Cong, Redmond, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/919,982
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Mar. 19, 2021, PCT No. PCT/US2021/023054 § 371(c)(1), (2) Date Oct. 19, 2022, PCT Pub. No. WO2021/242366, PCT Pub. Date Dec. 2, 2021.
Claims priority of application No. 202010460593.5 (CN), filed on May 26, 2020.
Prior Publication US 2023/0169953 A1, Jun. 1, 2023
Int. Cl. G10L 13/08 (2013.01); G10L 13/04 (2013.01)

CPC G10L 13/08 (2013.01) [G10L 13/04 (2013.01)]

13 Claims

1. A method for phrase-based end-to-end text-to-speech (TTS) synthesis, comprising:

obtaining a text;

identifying a target phrase in the text;

determining a phrase context of the target phrase;

obtaining a reference audio;

generating an acoustic feature corresponding to the target phrase based at least on the target phrase, the phrase context and the reference audio, wherein the generating an acoustic feature comprises:

generating a context embedding representation of the phrase context;

generating an acoustic embedding representation of the reference audio; and

generating the acoustic feature through an acoustic model conditioned by the context embedding representation and the acoustic embedding representation; and

generating a speech waveform corresponding to the target phrase based on the acoustic feature.