US 11,963,771 B2
	Automatic depression detection method based on audio-video
Jianhua Tao, Beijing (CN); Cong Cai, Beijing (CN); Bin Liu, Beijing (CN); and Mingyue Niu, Beijing (CN)
Assigned to INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES, Beijing (CN)
Filed by INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES, Beijing (CN)
Filed on Sep. 10, 2021, as Appl. No. 17/472,191.
Claims priority of application No. 202110188624.0 (CN), filed on Feb. 19, 2021.
Prior Publication US 2022/0265184 A1, Aug. 25, 2022
Int. Cl. A61B 5/16 (2006.01); A61B 5/00 (2006.01); G06F 18/25 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/048 (2023.01); G06N 3/08 (2023.01); G06T 7/00 (2017.01); G06V 10/80 (2022.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01); G10L 25/63 (2013.01); G10L 25/66 (2013.01)

CPC A61B 5/165 (2013.01) [A61B 5/4803 (2013.01); A61B 5/7275 (2013.01); G06F 18/253 (2023.01); G06N 3/08 (2013.01); G06T 7/0012 (2013.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01); G10L 25/63 (2013.01); G10L 25/66 (2013.01); G06T 2207/10016 (2013.01)]

10 Claims

1. An automatic depression detection method using audio-video, characterized in that, the method comprises steps of:

S1, acquiring original data containing two modalities of long-term audio file and long-term video file from an audio-video file;

S2, sampling the long-term audio file at a certain sampling rate and dividing the long-term audio file into several audio segments, and meanwhile sampling the long-term video file at a certain sampling rate and dividing the long-term video file into a plurality of video segments;

S3, inputting each audio segment into an audio feature extraction network to obtain in-depth audio features, the audio feature extraction network comprising an expanded convolution layer and a time sequence pooling layer;

wherein inputting each audio segment into an audio feature extraction network to obtain in-depth audio features comprises: performing, firstly, convolution expanding on an input audio by three times, wherein a quantity of convolution kernels is set to 256, a size of convolution kernel is set to 2, an expansion rate is set to 2, a quantity of convolution layers is set to 4, a quantity of input channels is 1, a quantity of output channels is 256, and a data length is of 256; then, performing down-sampling through the time sequence pooling layer, wherein a quantity of channels and a data length are set to 128, respectively, so that the in-deep audio features contain time sequence dynamic information; and

inputting each video segment into a video feature extraction network to obtain in-depth video features, the video feature extraction network comprising a 3D convolution layer and a bidirectional long short-term memory network module;

wherein inputting each video segment into a video feature extraction network to obtain in-depth video features comprises: performing, firstly, 3D convolution on an input video frame, wherein a quantity of convolution kernels is set to 8, a size of convolution kernel is set to 3×3×3, and a step length is set to (2, 2, 2); then, inputting an output of the 3D convolution layer to the bidirectional long short-term memory network having a quantity of 64 output nodes to capture a time sequence representation of the video;

S4, calculating the in-depth audio features and the in-depth video features by using multi-head attention mechanism so as to obtain attention audio features and attention video features;

S5, aggregating the attention audio features and the attention video features into audio-video features through a feature aggregation model; and

S6, inputting the audio-video features into a decision network to predict a depression level of an individual in the audio-video file.