CPC G10L 25/63 (2013.01) [G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/30 (2013.01)] | 4 Claims |
1. A speech emotion recognition method based on fused population information, comprising the following steps:
S1: acquiring a user's audio data, expressed as Xaudio, through a recording acquisition device;
S2: preprocessing the acquired audio data Xaudio to generate a Mel spectrogram feature, expressed as Xmel;
S3: calculating energy of Mel spectrograms in different time frames for the generated Mel spectrogram feature Xmel, cutting off a front mute segment and a rear mute segment by setting a threshold to obtain a Mel spectrogram feature, expressed as Xinput, with a length of T;
S4: inputting the Mel spectrogram feature Xinput obtained in S3 into a population classification network to obtain population depth feature information, expressed as Hp;
S5: inputting the Mel spectrogram feature Xinput obtained in S3 into a Mel spectrogram preprocessing network to obtain Mel spectrogram depth feature information, expressed as Hm;
S6: fusing the population depth feature information Hp extracted in S4 with the Mel spectrogram depth feature information Hm extracted in S5 through a channel attention network SENet to obtain a fused feature, expressed as Hf; and
S7: inputting the fused feature Hf in S6 into the population classification network through a pooling layer to perform emotion recognition;
the population classification network is composed of a three-layer Long Short Term Memory (LSTM) network structure, and the S4 specifically comprises the following steps:
S4_1: first, segmenting the inputted Mel spectrogram feature Xinput with the length of T into three Mel spectrogram segments
in equal length in an overlapped manner, wherein the segmentation method is as follows: 0 to
is segmented as a first segment,
to
is segmented as a second segment, and
to T is segmented as a third segment; and
S4_2: inputting the three Mel spectrogram segments segmented in S4_1 into the three-layer LSTM network in turn, then taking the last output from the three-layer LSTM network as a final state, obtaining three hidden features for the three Mel spectrogram segments at last, and finally averaging the three hidden features to obtain the final population feature information Hp.
|