US 11,715,487 B2
Utilizing machine learning models to provide cognitive speaker fractionalization with empathy recognition
Mohit Chawla, New Delhi (IN); Balaji Janarthanam, Chennai (IN); Dinesh Vijayakumar, Bengaluru (IN); Sanjay Tiwari, Bengaluru (IN); Ashwini Purushothaman, Chennai (IN); Bhavika Sehgal, Ludhiana (IN); Rajesh Gala, Mumbai (IN); Saran Prasad, New Delhi (IN); Vinu Varghese, Bangalore (IN); Mohit Mahajan, Gurugram (IN); and Badarayan Panigrahi, Bangalore (IN)
Assigned to Accenture Global Solutions Limited, Dublin (IE)
Filed by Accenture Global Solutions Limited, Dublin (IE)
Filed on Mar. 31, 2021, as Appl. No. 17/218,952.
Prior Publication US 2022/0319535 A1, Oct. 6, 2022
Int. Cl. G10L 25/63 (2013.01); G06F 40/30 (2020.01); G10L 17/16 (2013.01); G10L 21/0272 (2013.01); G10L 17/02 (2013.01); G10L 25/24 (2013.01)
CPC G10L 25/63 (2013.01) [G06F 40/30 (2020.01); G10L 17/02 (2013.01); G10L 17/16 (2013.01); G10L 21/0272 (2013.01); G10L 25/24 (2013.01)] 20 Claims
OG exemplary drawing
 
8. A device, comprising:
one or more memories; and
one or more processors, communicatively coupled to the one or more memories, configured to:
receive audio data identifying a conversation including a plurality of speakers;
process the audio data, with a plurality of clustering models, to identify a plurality of speaker segments,
wherein the plurality of clustering models includes a k-means clustering model, a spectral clustering model, and an agglomerative clustering model;
determine a plurality of diarization error rates for the plurality of speaker segments;
identify a plurality of errors in the plurality of speaker segments based on comparing each of the plurality of diarization error rates to a threshold;
select a rectification model to rectify each of the plurality of errors based on a cause of a corresponding one of the plurality of errors and based on features of a corresponding one of the plurality of speaker segments;
re-segment the audio data with the rectification model to generate re-segmented audio data;
determine a plurality of modified diarization error rates for the plurality of speaker segments based on the re-segmented audio data;
select one of the plurality of speaker segments based on the plurality of modified diarization error rates;
calculate an empathy score based on the one of the plurality of speaker segments and based on an emotion score, an intent score, and a sentiment score determined based on the re-segmented audio data;
retrain the rectification model based on calculating the empathy score and utilizing the empathy score as additional training data; and
perform one or more actions based on the empathy score.