US 12,073,321 B2
Method and apparatus for training image caption model, and storage medium
Yang Feng, Shenzhen (CN); Lin Ma, Shenzhen (CN); Wei Liu, Shenzhen (CN); and Jiebo Luo, Shenzhen (CN)
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed by Tencent Technology (Shenzhen) Company Limited, Shenzhen (CN)
Filed on Oct. 20, 2020, as Appl. No. 17/075,618.
Application 17/075,618 is a continuation of application No. PCT/CN2019/094891, filed on Jul. 5, 2019.
Claims priority of application No. 201811167476.9 (CN), filed on Oct. 8, 2018.
Prior Publication US 2021/0034981 A1, Feb. 4, 2021
Int. Cl. G06N 3/08 (2023.01); G06N 3/045 (2023.01); G06N 3/047 (2023.01)
CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01); G06N 3/047 (2023.01)] 18 Claims
OG exemplary drawing
 
1. A method for training an image caption model, performed by an electronic device, the image caption model comprising an encoding convolutional neural network (CNN) and a decoding recurrent neural network (RNN) and a discriminative RNN, and the method comprising:
obtaining an image eigenvector of an image sample by using the encoding CNN;
decoding the image eigenvector by using the decoding RNN, to obtain a sentence used for describing the image sample;
determining a matching degree between the sentence obtained through decoding and the image sample, further including:
identifying objects in the image sample from an object detection model;
comparing the identified objects with object-representing words in the sentence obtained through decoding; and
determining the matching degree based on the comparison result and weights corresponding to the objects;
determining a smoothness degree of the sentence obtained through decoding further including:
inputting the sentence obtained through decoding into the discriminative RNN, and obtaining a first output of the discriminative RNN at each time point corresponding to the sentence obtained through decoding; and
determining the smoothness degree of the sentence obtained through decoding according to the first output of the discriminative RNN at each time point;
adjusting weighting parameters of the decoding RNN to improve the matching degree and the smoothness degree; and
adjusting the discriminative RNN according to the first output of the discriminative RNN at each time point, further including:
inputting a smooth sentence sample into the discriminative RNN, and obtaining a second output of the discriminative RNN at each time point corresponding to the smooth sentence sample, wherein the smooth sentence sample is irrelevant to the sentence obtained through decoding;
determining a discrimination loss of the discriminative RNN according to the first output and the second output of the discriminative RNN at each time point;
adjusting weighting parameters of the discriminative RNN to reduce the discrimination loss of the discriminative RNN.