US 12,014,720 B2
Voice synthesis method, model training method, device and computer device
Xixin Wu, Shenzhen (CN); Mu Wang, Shenzhen (CN); Shiyin Kang, Shenzhen (CN); Dan Su, Shenzhen (CN); and Dong Yu, Shenzhen (CN)
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed by Tencent Technology (Shenzhen) Company Limited, Shenzhen (CN)
Filed on Aug. 21, 2020, as Appl. No. 16/999,989.
Application 16/999,989 is a continuation of application No. PCT/CN2019/090493, filed on Jun. 10, 2019.
Claims priority of application No. 201810828220.1 (CN), filed on Jul. 25, 2018.
Prior Publication US 2020/0380949 A1, Dec. 3, 2020
Int. Cl. G10L 13/00 (2006.01); G10L 19/02 (2013.01)
CPC G10L 13/00 (2013.01) [G10L 19/02 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A speech synthesis method performed at a computer device having one or more processors and memory storing one or more programs to be executed by the one or more processors, the method comprising:
obtaining linguistic data;
encoding the linguistic data, to obtain encoded linguistic data;
obtaining reference linguistic data and corresponding target reference speech data;
encoding the reference linguistic data, to obtain encoded reference linguistic data;
decoding the encoded reference linguistic data, to obtain synthesized reference speech data;
determining a residual between the target reference speech data and the synthesized reference speech data;
obtaining an embedded vector for speech feature conversion, the embedded vector representing a speaking style feature of a target user and being generated according to the residual between the synthesized reference speech data synthesized from the reference linguistic data different from the linguistic data and the target reference speech data that correspond to the same reference linguistic data; and
decoding the encoded linguistic data by performing the speech feature conversion on the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data corresponding to the linguistic data.