| CPC G06V 20/41 (2022.01) [G06V 10/62 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/10 (2022.01); G06V 20/46 (2022.01)] | 20 Claims |

|
1. A method, implemented by a computing system, comprising:
receiving a video comprising a plurality of image frames;
generating, for the plurality of image frames and using a spatial-attention encoder, one or more image-frame features corresponding to one or more image frames of the plurality of image frames;
for the one or more image-frame features, generating, using a temporal-attention decoder, a predicted future feature based on the one or more image-frame features corresponding to the one or more image frames that precede a time associated with the predicted future feature; and
generating a video representation of a future action anticipation based on the predicted future feature, wherein the future action anticipation corresponds to an anticipation of a future action occurring after a sequence of actions observed in the plurality of image frames in the video, and wherein the video representation is configured for display on a user interface.
|