US 11,837,252 B2
	Speech emotion recognition method and system based on fused population information
Taihao Li, Zhejiang (CN); Shukai Zheng, Zhejiang (CN); Yulong Liu, Zhejiang (CN); Guanxiong Pei, Zhejiang (CN); and Shijie Ma, Zhejiang (CN)
Assigned to Zhejiang Lab, Zhejiang (CN)
Filed by Zhejiang Lab, Zhejiang (CN)
Filed on Jun. 21, 2022, as Appl. No. 17/845,908.
Application 17/845,908 is a continuation of application No. PCT/CN2022/070728, filed on Jan. 7, 2022.
Claims priority of application No. 202110322720.X (CN), filed on Mar. 26, 2021.
Prior Publication US 2022/0328065 A1, Oct. 13, 2022
Int. Cl. G10L 25/63 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)

CPC G10L 25/63 (2013.01) [G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)]

4 Claims

1. A speech emotion recognition method based on fused population information, comprising the following steps:

S1: acquiring a user's audio data, expressed as X_audio, through a recording acquisition device;

S2: preprocessing the acquired audio data X_audioto generate a Mel spectrogram feature, expressed as X_mel;

S3: calculating energy of Mel spectrograms in different time frames for the generated Mel spectrogram feature X_mel, cutting off a front mute segment and a rear mute segment by setting a threshold to obtain a Mel spectrogram feature, expressed as X_input, with a length of T;

S4: inputting the Mel spectrogram feature X_inputobtained in S3 into a population classification network to obtain population depth feature information, expressed as H_p;

S5: inputting the Mel spectrogram feature X_inputobtained in S3 into a Mel spectrogram preprocessing network to obtain Mel spectrogram depth feature information, expressed as H_m;

S6: fusing the population depth feature information H_pextracted in S4 with the Mel spectrogram depth feature information H_mextracted in S5 through a channel attention network SENet to obtain a fused feature, expressed as H_f; and

S7: inputting the fused feature H_fin S6 into the population classification network through a pooling layer to perform emotion recognition;

the population classification network is composed of a three-layer Long Short Term Memory (LSTM) network structure, and the S4 specifically comprises the following steps:

S4_1: first, segmenting the inputted Mel spectrogram feature X_inputwith the length of T into three Mel spectrogram segments

in equal length in an overlapped manner, wherein the segmentation method is as follows: 0 to

is segmented as a first segment,

is segmented as a second segment, and

to T is segmented as a third segment; and

S4_2: inputting the three Mel spectrogram segments segmented in S4_1 into the three-layer LSTM network in turn, then taking the last output from the three-layer LSTM network as a final state, obtaining three hidden features for the three Mel spectrogram segments at last, and finally averaging the three hidden features to obtain the final population feature information H_p.