| CPC G06V 10/82 (2022.01) [G06V 20/635 (2022.01)] | 22 Claims |

|
1. A method performed by one or more computers, the method comprising:
obtaining a set of one or more training images;
for each training image in the set:
processing the training image using a speaker neural network to generate a text caption for the training image;
generating a plurality of listener inputs, each listener input comprising (i) the text caption for the training image and (ii) a respective set of images, wherein the respective set of images includes a corresponding version of the training image and a corresponding set of one or more distractor images that are each different from the training image;
for each listener input, processing the listener input using a respective listener neural network from a set of one or more listener neural networks to generate a respective match score for each image in the respective set of images in the listener input; and
generating a reward for the training image based at least in part on the respective match scores for the training image generated by processing each of the listener inputs; and
training the speaker neural network using the rewards for the training images.
|