US 11,868,725 B2
	Server, client device, and operation methods thereof for training natural language understanding model
Hejung Yang, Suwon-si (KR); Kwangyoun Kim, Suwon-si (KR); and Sungsoo Kim, Suwon-si (KR)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Jan. 4, 2021, as Appl. No. 17/140,619.
Claims priority of provisional application 62/956,500, filed on Jan. 2, 2020.
Claims priority of application No. 10-2020-0019989 (KR), filed on Feb. 18, 2020.
Prior Publication US 2021/0209304 A1, Jul. 8, 2021
Int. Cl. G10L 15/187 (2013.01); G06F 40/295 (2020.01); G10L 13/00 (2006.01)

CPC G06F 40/295 (2020.01) [G10L 13/00 (2013.01); G10L 15/187 (2013.01)]

15 Claims

1. A method, performed by a server, of training a language model by using a text, the method comprising:

receiving, from a client device, an input text input by a user;

identifying a replacement target text to be replaced from among one or more words included in the input text;

generating a replacement text that is predicted to be uttered for the identified replacement target text by the user and has a phonetic similarity with the identified replacement target text;

generating one or more training text candidates by replacing, within the input text, the replacement target text with the generated replacement text; and

training a natural language understanding (NLU) model by using the input text and the one or more training text candidates as training data

wherein the generating of the replacement text comprises:

converting the replacement target text into a custom wave signal by using a personalized text-to-speech (TTS) model, wherein the personalized TTS model is an artificial intelligence model trained to generate a wave signal from a text by reflecting personalized characteristics including at least one of age, gender, region, dialect, intonation, and pronunciation of the user;

outputting the custom wave signal; and

generating the replacement text by converting the custom wave signal into an output text using an automatic speech recognition (ASR) model.