US 12,333,258 B2
	Multi-level emotional enhancement of dialogue
Sanchita Tiwari, Cumming, GA (US); Justin Ali Kennedy, Norwell, MA (US); Dirk Van Dall, Shelter Island, NY (US); Xiuyang Yu, Unionville, CT (US); Daniel Cahall, Philadelphia, PA (US); and Brian Kazmierczak, Hamden, CT (US)
Assigned to Disney Enterprises, Inc., Burbank, CA (US)
Filed by Disney Enterprises, Inc., Burbank, CA (US)
Filed on Aug. 24, 2022, as Appl. No. 17/894,967.
Prior Publication US 2024/0070399 A1, Feb. 29, 2024
Int. Cl. G06F 40/35 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01); G06N 20/00 (2019.01); G10L 13/08 (2013.01); G10L 25/63 (2013.01)

CPC G06F 40/35 (2020.01) [G06F 40/284 (2020.01); G06F 40/289 (2020.01); G06N 20/00 (2019.01); G10L 13/08 (2013.01); G10L 25/63 (2013.01)]

21 Claims

1. A system comprising:

a computing platform having a speech synthesizer, a processing hardware and a system memory storing a software code, the software code including at least one of a trained machine learning (ML) model configured to function as an autoregressive generator or a stochastic model trained using unsupervised or semi-supervised learning;

the processing hardware configured to execute the software code to:

receive dialogue data identifying an utterance for use by a digital character in a conversation;

analyze, using the dialogue data, an emotionality of the utterance at a plurality of structural levels of the utterance;

supplement the utterance with one or more emotional attributions, using the at least one of the trained ML model or the stochastic model and the emotionality of the utterance at the plurality of structural levels, to provide a plurality of candidate emotionally enhanced utterances;

perform an audio validation of at least some of the plurality of candidate emotionally enhanced utterances to provide a validated emotionally enhanced utterance including a non-verbal vocalization, wherein the audio validation identifies the validated emotionally enhanced utterance as having a best audio quality of the at least some of the plurality of candidate emotionally enhanced utterances;

output an emotionally attributed dialogue data providing the validated emotionally enhanced utterance for use by the digital character in the conversation; and

synthesize, using the speech synthesizer and the emotionally attributed dialogue data, the validated emotionally enhanced utterance to generate a synthesized speech for utterance by the digital character.