US 11,908,447 B2
Method and apparatus for synthesizing multi-speaker speech using artificial neural network
Joon Hyuk Chang, Seoul (KR); and Jae Uk Lee, Seoul (KR)
Assigned to IUCF-HYU (INDUSTRY-UNIVERSITY COOPERATION FOUNDATION HANYANG UNIVERSITY), Seoul (KR)
Appl. No. 17/596,037
Filed by IUCF-HYU (Industry-University Cooperation Foundation Hanyang University), Seoul (KR)
PCT Filed Aug. 4, 2021, PCT No. PCT/KR2021/010307
§ 371(c)(1), (2) Date Dec. 2, 2021,
PCT Pub. No. WO2022/031060, PCT Pub. Date Feb. 10, 2022.
Claims priority of application No. 10-2020-0097585 (KR), filed on Aug. 4, 2020.
Prior Publication US 2023/0178066 A1, Jun. 8, 2023
Int. Cl. G10L 13/047 (2013.01); G10L 17/04 (2013.01); G10L 17/06 (2013.01); G10L 17/18 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)
CPC G10L 13/047 (2013.01) [G10L 17/04 (2013.01); G10L 17/06 (2013.01); G10L 17/18 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)] 9 Claims
OG exemplary drawing
 
1. A method for synthesizing multi-speaker speech using an artificial neural network, comprising:
generating and storing a speech learning model for a plurality of users by subjecting a synthetic artificial neural network of a speech synthesis model to learning, based on speech data of the plurality of users;
generating speaker vectors for a new user who has not been learned and the plurality of users who have already been learned by using a speaker recognition model;
determining a speaker vector having a most similar relationship with the speaker vector of the new user according to preset criteria out of the speaker vectors of the plurality of users who have already been learned;
generating and learning a speaker embedding of the new user by subjecting the synthetic artificial neural network of the speech synthesis model to learning, by using a value of a speaker embedding of a user for the determined speaker vector as an initial value and based on speaker data of the new user;
wherein the generating a speech learning model for the new user comprises performing the learning of the synthetic artificial neural network of the speech synthesis model only for a preset time that prevents overfitting,
wherein the preset time comprises a range of 10 seconds to 60 seconds, and
wherein the generating speaker vectors comprises generating the speaker vectors using an artificial neural network of the speaker recognition model, by using a speech signal of the user as an input value.