CPC G06T 13/40 (2013.01) [G06T 13/205 (2013.01); G06V 40/176 (2022.01); G06V 40/193 (2022.01); G10L 15/1815 (2013.01); G10L 15/22 (2013.01); G10L 25/63 (2013.01)] | 20 Claims |
1. A method comprising:
receiving, by a processor via an audio-visual input device, audio-visual input data of user communications during a period of time;
utilizing, by the processor, at least one speech recognition model to recognize speech data of the audio-visual input data;
inputting, by the processor, the speech data into at least one natural language understanding model to produce speech recognition data indicative of meaning, intent and sentiment;
determining, by the processor, at least one current emotional complex signature associated with user reactions during a current emotional state of the user during the period of time based at least in part on:
the speech recognition data and at least one of:
at least one time-varying speech emotion metric or
at least one time-varying facial emotion metric;
wherein the at least one time-varying speech emotion metric is determined by:
determining, by the processor, the at least one time-varying speech emotion metric throughout the period of time based at least in part on the speech recognition data; and
wherein the at least one time-varying facial emotion metric is determined by:
utilizing, by the processor, at least one facial emotion recognition model to produce facial action units representative of recognized facial features represented in the audio-visual input data;
determining, by the processor, the at least one time-varying facial emotion metric throughout the period of time based at least in part on the speech recognition data, the facial action units and a facial action coding system;
logging, by the processor, at least one current emotional complex signature in a memory;
tagging, by the processor, a high amplitude-high confidence interaction to indicate at least one changed emotional state where a magnitude of the at least one current emotional complex signature exceeds a predetermined threshold; and
presenting, via at least one output device, by the processor, a virtual representation of a responder to the user in response to the at least one changed emotional state.
|