US 12,406,653 B2
	Customizing text-to-speech language models using adapters for conversational AI systems and applications
Cheng-Ping Hsieh, La Jolla, CA (US); Subhankar Ghosh, Santa Clara, CA (US); and Boris Ginsburg, Sunnyvale, CA (US)
Assigned to NVIDIA CORPORATION, Santa Clara, CA (US)
Filed by NVIDIA CORPORATION, Santa Clara, CA (US)
Filed on Oct. 13, 2022, as Appl. No. 17/965,708.
Prior Publication US 2024/0127788 A1, Apr. 18, 2024
Int. Cl. G10L 13/00 (2006.01); G10L 17/02 (2013.01)

CPC G10L 13/00 (2013.01) [G10L 17/02 (2013.01)]

19 Claims

1. A method comprising:

determining, based at least on identification data corresponding to a speaker, an identity embedding associated with the speaker;

activating, based at least on the identity embedding, one or more adapters, from a plurality of adapters included in a text-to-speech (TTS) machine learning model, that correspond to the speaker, wherein each of the plurality of adapters is trained using speaker-specific training data separately from fixed components of the TTS machine learning model;

processing, using the TTS machine learning model including the one or more activated adapters, a textual input to generating a speech representation corresponding to the speaker; and

causing output of audio corresponding to the speech representation.