US 12,406,496 B2
	Anticipative video transformer model for future action anticipation
Rohit Girdhar, Jersey City, NJ (US); and Kristen Lorraine Grauman, Austin, TX (US)
Filed by Meta Platforms, Inc., Menlo Park, CA (US)
Filed on May 25, 2022, as Appl. No. 17/824,402.
Prior Publication US 2023/0386203 A1, Nov. 30, 2023
Int. Cl. G06V 20/40 (2022.01); G06V 10/62 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/10 (2022.01)

CPC G06V 20/41 (2022.01) [G06V 10/62 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/10 (2022.01); G06V 20/46 (2022.01)]

20 Claims

1. A method, implemented by a computing system, comprising:

receiving a video comprising a plurality of image frames;

generating, for the plurality of image frames and using a spatial-attention encoder, one or more image-frame features corresponding to one or more image frames of the plurality of image frames;

for the one or more image-frame features, generating, using a temporal-attention decoder, a predicted future feature based on the one or more image-frame features corresponding to the one or more image frames that precede a time associated with the predicted future feature; and

generating a video representation of a future action anticipation based on the predicted future feature, wherein the future action anticipation corresponds to an anticipation of a future action occurring after a sequence of actions observed in the plurality of image frames in the video, and wherein the video representation is configured for display on a user interface.