US 12,283,267 B2
Speech synthesis apparatus and method thereof
Sang Il Ahn, Cheongju-si (KR); Seung Woo Choi, Seoul (KR); Seung Ju Han, Seoul (KR); Dong Young Kim, Seoul (KR); and Sung Joo Ha, Gyeonggi-do (KR)
Assigned to Hyperconnect LLC, Dallas, TX (US)
Filed by Hyperconnect LLC, Seoul (KR)
Filed on Nov. 16, 2021, as Appl. No. 17/455,211.
Claims priority of application No. 10-2020-0178870 (KR), filed on Dec. 18, 2020.
Prior Publication US 2022/0199068 A1, Jun. 23, 2022
Int. Cl. G10L 13/00 (2006.01); G10L 13/047 (2013.01); G10L 15/16 (2006.01)
CPC G10L 13/047 (2013.01) [G10L 15/16 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A speech synthesis method, the speech synthesis method comprising:
acquiring a second set of speech data and a target text;
acquiring, using a text-to-speech synthesis model trained based on text data corresponding to a first set of speech data and at least a portion of the first set of speech data, a first set of information, wherein the first set of information includes a first set of embedding information comprising the second set of speech data;
acquiring, using the text-to-speech synthesis model, a second set of information, wherein:
the second set of information includes a second set of embedding information, comprising embeddings of the second set of speech data,
the second set of embedding information is acquired by deploying an attention mechanism using query components generated based on a sequence of the target text, and
acquiring the second set of information comprises:
encoding the target text, and
extracting the query components from the encoded target text. wherein:
the query components are generated based on a sequence of the encoded target text, and
the sequence is generated based on the first set of information and the encoded target text;
acquiring audio data, using the text-to-speech synthesis model, wherein the audio data:
corresponds to the target text, and
reflects characteristics of speech of a speaker of the second set of speech data, as a sound spectrum visualization generated based on the first set of information and the second set of information; and
deriving, using the text-to-speech synthesis model, a speech recording corresponding to the audio data.