US 11,810,359 B2
Video semantic segmentation method based on active learning
Xin Yang, Liaoning (CN); Xiaopeng Wei, Liaoning (CN); Yu Qiao, Liaoning (CN); Qiang Zhang, Liaoning (CN); Baocai Yin, Liaoning (CN); Haiyin Piao, Liaoning (CN); and Zhenjun Du, Liaoning (CN)
Assigned to DALIAN UNIVERSITY OF TECHNOLOGY, Liaoning (CN)
Filed by DALIAN UNIVERSITY OF TECHNOLOGY, Liaoning (CN)
Filed on Dec. 21, 2021, as Appl. No. 17/557,933.
Claims priority of application No. 202110012126.0 (CN), filed on Jan. 6, 2021.
Prior Publication US 2022/0215662 A1, Jul. 7, 2022
Int. Cl. G06V 10/00 (2022.01); G06V 20/40 (2022.01); G06V 10/46 (2022.01); G06V 10/82 (2022.01); G06T 3/40 (2006.01); G06T 7/215 (2017.01); G06T 9/00 (2006.01); G06V 10/72 (2022.01); G06V 10/764 (2022.01); G06V 10/778 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06T 7/10 (2017.01); G06F 18/21 (2023.01); G06F 18/214 (2023.01)
CPC G06V 20/49 (2022.01) [G06F 18/217 (2023.01); G06F 18/2155 (2023.01); G06T 3/4007 (2013.01); G06T 3/4046 (2013.01); G06T 7/10 (2017.01); G06T 7/215 (2017.01); G06T 9/002 (2013.01); G06V 10/46 (2022.01); G06V 10/72 (2022.01); G06V 10/764 (2022.01); G06V 10/776 (2022.01); G06V 10/778 (2022.01); G06V 10/7753 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06T 2207/10016 (2013.01); G06T 2207/10024 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)] 1 Claim
OG exemplary drawing
 
1. A video semantic segmentation method based on active learning, comprising an image semantic segmentation module, a data selection module based on the active learning and a label propagation module; wherein the image semantic segmentation module is responsible for segmenting image results and extracting high-level features required by the data selection module based on active learning; the data selection module based on active learning selects a data subset with rich information at an image level, and selects pixel blocks to be labeled at a pixel level; the label propagation module realizes migration from image to video tasks and completes the segmentation result of a video quickly to obtain weakly-supervised data;
(1) Image Semantic Segmentation Module
the image semantic segmentation module is composed of an improved full convolutional network; a backbone network architecture adopts Mobilenet v2 structure to extract the features of RGB images; After obtaining high-level feature information, a decoder converts the number of feature channels into the number of categories to achieve the effect of pixel classification; and finally, a semantic label image with classification information of the same size as the RGB images is obtained by upsampling;
(1.1) Input of the Image Semantic Segmentation Module:
a semantic segmentation network has no size limit on the input RGB images, and a selection strategy at the pixel level needs to fix the size of the images, so the input training data is resized; the input training data is divided into two parts: one part comprises the RGB images denoted as x, and the other part comprises corresponding semantic labels denoted as y; the input data is adjusted in the following way:
X=B(x)  (1)
Y=N(y)  (2)
wherein B(x) represents that the RGB images are processed by bilinear interpolation, and N(y) represents that the semantic labels are processed by nearest neighbor interpolation;
(1.2) Feature Extraction Encoder Module:
the RGB images are feed into the semantic segmentation network; firstly, the number of the channels is converted from 3 channels to 32 channels through an initial convolution layer of which the feature is denoted as Finit; then, a high-level feature with length and width of 16 and 32 is obtained by seven residual convolutions; Bottleneck residual blocks of Mobilenetv2 are used, and the final number of the channels is 320; therefore, the level of the high-level feature (HLF) is 16×32×320; the sum of the input and the features that pass through the first 3 Bottleneck residue blocks is used as a low-level feature (LLF); LLF is expressed as:
LLF=[Finit,BN_1(x),BN_2(x),BN_3(x)]  (3)
wherein BN_1(x), BN_2(x) and BN_3(x) represent the features that pass through the first 3 residue blocks respectively; [ ] is concatenation operation;
(1.3) Decoder Module:
the above high-level feature HLF is sampled by atrous convolution with different sampling rates through an atrous spatial convolution pooling pyramid (ASPP); the sampled feature is fused with the low-level feature LLF and input into the decoder module for decoding the number of the channels, and finally the channel size of the corresponding object category number in the image is obtained; the whole process is described as follows:
Fdecode=DEC(FASPP,LLF)  (4)
where FASPP is the associative feature output by the ASPP; DEC represents the decoder module designed by the method; FASPP passes through the convolution layer to make the level the same as the feature level in the LLF; the two levels are concatenated in the channel level and pass through a deconvolution layer to obtain Fdecode; Fdecode is obtained and then input into a bilinear upsampling layer, so that the feature is converted to the same size as the original RGB image; each pixel on the image corresponds to a predicted category result Fclass;
(2) Data Selection Module Based on the Active Learning
(2.1) Image-Level Data Selection Module:
after the RGB image passes through the image semantic segmentation module, a final predicted result Fclass is obtained, and a middle feature Fdecode extracted from an encoder module by the method is used as the input of the image-level data selection module; Fdecode is input into a designed matcher rating network; firstly, a convolution kernel is used as the input feature for level reduction operation of a global pooling layer of the last two levels to obtain a vectorVclass with the same size as the number of categories; Vclass is feed into three full connection layers, and the number of the channels is decreased successively from the number of the categories, 16, 8 and 1 to finally obtain a value S; the closer S is to 0, the better the performance of the selected image in the image semantic segmentation module is; otherwise, the effect is worse;
the formula to calculate the loss by the image semantic segmentation network in a training process adopts a cross entropy function, and the function is expressed as formula (5):
Lseg=−Σc=1Myc log(pc)  (5)
wherein M represents the number of the categories; yc represents category judgment of variables, which is 1 for the same categories and 0 for different categories; pc represents a predicted probability that an observed sample belongs to category c; after Vclass is obtained by the data selection module based on the active learning, the MSE loss function of the following formula (7) is designed to improve the performance of the selection module:
Lpre=(Lseg−Vclass)2  (6)
wherein Lseg is loss obtained during the training of the image semantic segmentation module, and Vclass is a value obtained by the selection module; a gap between the two is reduced by constant iterative optimization of an optimizer to achieve the purpose of selection and optimization of the selection module; the overall loss function is expressed by the formula (7):
Ltotal=Lseg+λLpre  (7)
wherein λ is a hyper parameter used to control the proportion of Lpre in the whole loss, and the value of λ ranges from 0 to 1; after the training, fixed parameters are predicted on unlabeled data, and each image obtains a corresponding Lpre; Lpre is sequenced to select the first N images with maximum values as data subsets to be labeled in the next round;
(2.2) Pixel-Level Data Selection Module:
after passing the image-level data selection module, some data subsets to be labeled are selected; the selected data subsets are feed to obtain the distribution of information entropy on each image; the information entropy is calculated by vote entropy, which is improved on the basis of formula (5) and expressed as follows:

OG Complex Work Unit Math
wherein D represents the frequency of votes and D is set as 20; then, a pixel window of 16*16 size is used to slide on the image to calculate the information in each pixel window; and finally, the pixel windows with most information are selected through sequencing;
(3) Label Propagation Module
the data selection module based on the active learning selects a frame t and obtains a moving distance (δx, δy) of each pixel between the frame t and a frame t+1 through optical flow estimation, described as follows:
p(δx,δy)=OF(t,t+1)  (9)
wherein p(δx,δy) is the moving distance of the pixel; in the method, the existing FlowNetS is used as a propagation module to estimate the moving distance of the pixel; after the moving distance p(δx,δy) of the pixel is obtained, the semantic segmentation label of the frame t is input, to correspond to each pixel, to obtain the semantic segmentation result of the frame t+1; the whole process is described as follows:
Gt+1=warp(Gt,p(δx,δy))  (10)
wherein warp is a pixel wrapping function, that is, pixels corresponding to Gt on the RGB images are superimposed in x and y directions.