US 11,837,252 B2
Speech emotion recognition method and system based on fused population information
Taihao Li, Zhejiang (CN); Shukai Zheng, Zhejiang (CN); Yulong Liu, Zhejiang (CN); Guanxiong Pei, Zhejiang (CN); and Shijie Ma, Zhejiang (CN)
Assigned to Zhejiang Lab, Zhejiang (CN)
Filed by Zhejiang Lab, Zhejiang (CN)
Filed on Jun. 21, 2022, as Appl. No. 17/845,908.
Application 17/845,908 is a continuation of application No. PCT/CN2022/070728, filed on Jan. 7, 2022.
Claims priority of application No. 202110322720.X (CN), filed on Mar. 26, 2021.
Prior Publication US 2022/0328065 A1, Oct. 13, 2022
Int. Cl. G10L 25/63 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)
CPC G10L 25/63 (2013.01) [G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)] 4 Claims
OG exemplary drawing
 
1. A speech emotion recognition method based on fused population information, comprising the following steps:
S1: acquiring a user's audio data, expressed as Xaudio, through a recording acquisition device;
S2: preprocessing the acquired audio data Xaudio to generate a Mel spectrogram feature, expressed as Xmel;
S3: calculating energy of Mel spectrograms in different time frames for the generated Mel spectrogram feature Xmel, cutting off a front mute segment and a rear mute segment by setting a threshold to obtain a Mel spectrogram feature, expressed as Xinput, with a length of T;
S4: inputting the Mel spectrogram feature Xinput obtained in S3 into a population classification network to obtain population depth feature information, expressed as Hp;
S5: inputting the Mel spectrogram feature Xinput obtained in S3 into a Mel spectrogram preprocessing network to obtain Mel spectrogram depth feature information, expressed as Hm;
S6: fusing the population depth feature information Hp extracted in S4 with the Mel spectrogram depth feature information Hm extracted in S5 through a channel attention network SENet to obtain a fused feature, expressed as Hf; and
S7: inputting the fused feature Hf in S6 into the population classification network through a pooling layer to perform emotion recognition;
the population classification network is composed of a three-layer Long Short Term Memory (LSTM) network structure, and the S4 specifically comprises the following steps:
S4_1: first, segmenting the inputted Mel spectrogram feature Xinput with the length of T into three Mel spectrogram segments

OG Complex Work Unit Math
in equal length in an overlapped manner, wherein the segmentation method is as follows: 0 to

OG Complex Work Unit Math
is segmented as a first segment,

OG Complex Work Unit Math
to

OG Complex Work Unit Math
is segmented as a second segment, and

OG Complex Work Unit Math
to T is segmented as a third segment; and
S4_2: inputting the three Mel spectrogram segments segmented in S4_1 into the three-layer LSTM network in turn, then taking the last output from the three-layer LSTM network as a final state, obtaining three hidden features for the three Mel spectrogram segments at last, and finally averaging the three hidden features to obtain the final population feature information Hp.