| CPC G06V 10/82 (2022.01) [G06V 10/776 (2022.01)] | 20 Claims |

|
1. A computer-implemented method for training a machine-learned visual attention model, the method comprising:
obtaining, by a computing system comprising one or more computing devices, image data and an associated ground truth visual attention label, wherein the image data depicts at least a head of a person and an additional entity;
processing, by the computing system, the image data with an encoder portion of the machine-learned visual attention model to obtain a latent head encoding and a latent entity encoding;
processing, by the computing system, the latent head encoding and the latent entity encoding with the machine-learned visual attention model to obtain a visual attention value indicative of whether a visual attention of the person is focused on the additional entity;
processing, by the computing system, the latent head encoding and the latent entity encoding with a machine-learned three-dimensional visual location model to obtain a three-dimensional visual location estimation, wherein the three-dimensional visual location estimation comprises an estimated three-dimensional spatial location of the visual attention of the person;
evaluating, by the computing system, a loss function that evaluates a difference between the three-dimensional visual location estimation and a pseudo visual location label derived from the image data and a difference between the visual attention value and the ground truth visual attention label; and
respectively adjusting, by the computing system, one or more parameters of the machine-learned visual attention model and the machine-learned three-dimensional visual location model based at least in part on the loss function.
|