US 12,437,528 B2
	Training a speaker neural network using one or more listener neural networks
Aaditya K. Singh, McLean, VA (US); Fengning Ding, London (GB); Felix George Hill, London (GB); and Andrew Kyle Lampinen, Palo Alto, CA (US)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on May 19, 2023, as Appl. No. 18/199,896.
Claims priority of provisional application 63/343,960, filed on May 19, 2022.
Prior Publication US 2023/0401835 A1, Dec. 14, 2023
Int. Cl. G06V 10/82 (2022.01); G06V 20/62 (2022.01)

CPC G06V 10/82 (2022.01) [G06V 20/635 (2022.01)]

22 Claims

1. A method performed by one or more computers, the method comprising:

obtaining a set of one or more training images;

for each training image in the set:

processing the training image using a speaker neural network to generate a text caption for the training image;

generating a plurality of listener inputs, each listener input comprising (i) the text caption for the training image and (ii) a respective set of images, wherein the respective set of images includes a corresponding version of the training image and a corresponding set of one or more distractor images that are each different from the training image;

for each listener input, processing the listener input using a respective listener neural network from a set of one or more listener neural networks to generate a respective match score for each image in the respective set of images in the listener input; and

generating a reward for the training image based at least in part on the respective match scores for the training image generated by processing each of the listener inputs; and

training the speaker neural network using the rewards for the training images.