US 12,406,487 B2
	Systems and methods for training machine-learned visual attention models
Xuhui Jia, Seattle, WA (US); Raviteja Vemulapalli, Seattle, WA (US); Bradley Ray Green, Bellevue, WA (US); Bardia Doosti, Bloomington, IN (US); Ching-Hui Chen, Shoreline, WA (US); and Yukon Zhu, Shoreline, WA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 18/006,078
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Aug. 3, 2020, PCT No. PCT/US2020/044717 § 371(c)(1), (2) Date Jan. 19, 2023, PCT Pub. No. WO2022/031261, PCT Pub. Date Feb. 10, 2022.
Prior Publication US 2023/0281979 A1, Sep. 7, 2023
Int. Cl. G06V 10/82 (2022.01); G06V 10/776 (2022.01)

CPC G06V 10/82 (2022.01) [G06V 10/776 (2022.01)]

20 Claims

1. A computer-implemented method for training a machine-learned visual attention model, the method comprising:

obtaining, by a computing system comprising one or more computing devices, image data and an associated ground truth visual attention label, wherein the image data depicts at least a head of a person and an additional entity;

processing, by the computing system, the image data with an encoder portion of the machine-learned visual attention model to obtain a latent head encoding and a latent entity encoding;

processing, by the computing system, the latent head encoding and the latent entity encoding with the machine-learned visual attention model to obtain a visual attention value indicative of whether a visual attention of the person is focused on the additional entity;

processing, by the computing system, the latent head encoding and the latent entity encoding with a machine-learned three-dimensional visual location model to obtain a three-dimensional visual location estimation, wherein the three-dimensional visual location estimation comprises an estimated three-dimensional spatial location of the visual attention of the person;

evaluating, by the computing system, a loss function that evaluates a difference between the three-dimensional visual location estimation and a pseudo visual location label derived from the image data and a difference between the visual attention value and the ground truth visual attention label; and

respectively adjusting, by the computing system, one or more parameters of the machine-learned visual attention model and the machine-learned three-dimensional visual location model based at least in part on the loss function.