CPC G06N 3/08 (2013.01) [G06N 3/045 (2023.01); G06N 3/047 (2023.01)] | 18 Claims |
1. A method for training an image caption model, performed by an electronic device, the image caption model comprising an encoding convolutional neural network (CNN) and a decoding recurrent neural network (RNN) and a discriminative RNN, and the method comprising:
obtaining an image eigenvector of an image sample by using the encoding CNN;
decoding the image eigenvector by using the decoding RNN, to obtain a sentence used for describing the image sample;
determining a matching degree between the sentence obtained through decoding and the image sample, further including:
identifying objects in the image sample from an object detection model;
comparing the identified objects with object-representing words in the sentence obtained through decoding; and
determining the matching degree based on the comparison result and weights corresponding to the objects;
determining a smoothness degree of the sentence obtained through decoding further including:
inputting the sentence obtained through decoding into the discriminative RNN, and obtaining a first output of the discriminative RNN at each time point corresponding to the sentence obtained through decoding; and
determining the smoothness degree of the sentence obtained through decoding according to the first output of the discriminative RNN at each time point;
adjusting weighting parameters of the decoding RNN to improve the matching degree and the smoothness degree; and
adjusting the discriminative RNN according to the first output of the discriminative RNN at each time point, further including:
inputting a smooth sentence sample into the discriminative RNN, and obtaining a second output of the discriminative RNN at each time point corresponding to the smooth sentence sample, wherein the smooth sentence sample is irrelevant to the sentence obtained through decoding;
determining a discrimination loss of the discriminative RNN according to the first output and the second output of the discriminative RNN at each time point;
adjusting weighting parameters of the discriminative RNN to reduce the discrimination loss of the discriminative RNN.
|