US 12,380,915 B2
Machine learning based emotion prediction and forecasting in conversation
Rosalin Parida, Soro (IN); Bhushan Gurmukhdas Jagyasi, Thane West (IN); Surajit Sen, Bangalore (IN); Aditi Debsharma, Thane (IN); and Gopali Raval Contractor, Mumbai (IN)
Assigned to ACCENTURE GLOBAL SOLUTIONS LIMITED, Dublin (IE)
Filed by Accenture Global Solutions Limited, Dublin (IE)
Filed on Nov. 30, 2022, as Appl. No. 18/071,884.
Prior Publication US 2024/0177729 A1, May 30, 2024
Int. Cl. G10L 25/63 (2013.01); G10L 25/27 (2013.01)
CPC G10L 25/63 (2013.01) [G10L 25/27 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method for emotion recognition and forecasting in conversations comprising:
obtaining, with a processor circuitry, an audio data of a conversation involving a plurality of speakers, the audio data comprising a plurality of utterances of the speakers;
identifying a plurality of turns of the conversation from the plurality of utterances, a turn representing a temporal speech window unit for analyzing emotion features of the speakers;
extracting, with the processor circuitry, audio embedding features from the plurality of turns;
obtaining, with the processor circuitry, a plurality of text segments associated with the audio data;
extracting, with the processor circuitry, text embedding features from the plurality of text segments;
obtaining, with the processor circuitry, speaker embedding features associated with the audio data;
concatenating, with the processor circuitry, the speaker embedding features;
obtaining, with the processor circuitry, a plurality of emotion features corresponding to the plurality of turns, the plurality of emotion features indicating temporal emotion dynamics of the speakers over the turns;
concatenating, with the processor circuitry, the plurality of emotion features; and
executing, with the processor circuitry, a tree-based prediction model to predict emotion features of the plurality of speakers for a subsequent turn of the conversation based on the audio embedding features, text embedding features, the concatenated speaker embedding features, and the concatenated emotion features,
wherein the tree-based prediction model comprises multiple layers of stacked ensemble models, and
wherein the multiple layers of stacked ensemble models comprise a first layer of ensemble models and a second layer of ensemble models, the executing the tree-based prediction model to predict the emotion features of the plurality of speakers for a subsequent turn of the conversation comprises:
inputting the audio embedding features, the text embedding features, the concatenated speaker embedding features, and the concatenated emotion features to the first layer of ensemble models for each of the first layer of ensemble model to predict intermediate emotion features of the plurality of speakers for the subsequent turn respectively, to obtain first layer emotion prediction results;
concatenating the first layer emotion prediction results with the audio embedding features, the text embedding features, the concatenated speaker embedding features, and the concatenated emotion features as intermediate concatenated embedding features;
inputting the intermediated concatenated embedding features to the second layer of ensemble models for each of the second layer of ensemble models to predict intermediate emotion features of the plurality of speakers for the subsequent turn respectively, to obtain second layer emotion prediction results; and
determining the emotion features of the plurality of speakers for the subsequent turn based on the second layer prediction results.