US 11,996,118 B2
Selection of speech segments for training classifiers for detecting emotional valence from input speech signals
Ramesh Kumar Ramakrishnan, Bangalore (IN); Venkata Subramanian Viraraghavan, Bangalore (IN); Rahul Dasharath Gavas, Bangalore (IN); Sachin Patel, Pune (IN); and Gauri Deshpande, Pune (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Oct. 19, 2021, as Appl. No. 17/504,556.
Claims priority of application No. 202021046176 (IN), filed on Oct. 22, 2020.
Prior Publication US 2022/0130414 A1, Apr. 28, 2022
Int. Cl. G10L 25/63 (2013.01); G06N 20/00 (2019.01); G10L 15/04 (2013.01); G10L 25/27 (2013.01)
CPC G10L 25/63 (2013.01) [G06N 20/00 (2019.01); G10L 15/04 (2013.01); G10L 25/27 (2013.01)] 9 Claims
OG exemplary drawing
 
1. A processor implemented method, comprising:
obtaining, via one or more hardware processors, (i) a speech signal corresponding to one or more users, and (ii) a corresponding text transcription from the speech signal;
splitting; via the one or more hardware processors, the speech signal to a plurality of speech segments;
determining, via the one or more hardware processors, one of (i) one or more emotion words-based speech segments, or (ii) one or more non-emotion words-based speech segments from the plurality of speech segments based on (a) the speech signal, (b) the obtained text transcription, and (c) a language specific emotion words-based dictionary;
selecting, via the one or more hardware processors, one or more training segments from the plurality of speech segments based on (i) one or more emotion words-based speech segments, or (ii) the one or more non-emotion the words-based speech segments;
training, via the one or more hardware processors, one or more classifiers using the one or more selected training segments to obtain one or more trained classifiers;
measuring, via the one or more hardware processors, an accuracy of the trained one or more classifiers to determine an optimal trained classifier among the trained one or more classifiers, wherein
the accuracy of the trained one or more classifiers using one or more accuracy measuring techniques, and
the one or more accuracy measuring techniques includes an Unweighted Average Recall (UAR), a Weighted Average Recall (WAR), a Geometric Mean (GM);
detecting, via the one or more hardware processors, an emotional valance of a specific speech signal, wherein the emotional valance of the specific speech signal is detected by using the determined optimal trained classifier, wherein the specific speech signal is different from the speech signal;
comparing, via the one or more hardware processors, the detected emotional valance of the specific speech with a ground truth valance;
calculating, via the one or more hardware processors, an accuracy of the detected emotional valence, wherein the accuracy of the detected emotional valence is calculated based on the comparison; and
categorizing, via the one or more hardware processors, the detected emotional valance of the specific speech signal into a Low valence, a Medium valence, and a High valence.