US 12,112,538 B2
Systems and methods for improved video understanding
Anurag Arnab, Grenoble (FR); Mostafa Dehghani, Amsterdam (NL); Georg Heigold, Aachen (DE); Chen Sun, San Francisco, CA (US); Mario Lucic, Adliswil (CH); and Cordelia Luise Schmid, Saint-Ismier (FR)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jul. 8, 2021, as Appl. No. 17/370,522.
Prior Publication US 2023/0017072 A1, Jan. 19, 2023
Int. Cl. G06V 20/40 (2022.01); G06N 20/00 (2019.01)
CPC G06V 20/41 (2022.01) [G06N 20/00 (2019.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01)] 16 Claims
OG exemplary drawing
 
1. A computer-implemented method for classifying video data with improved accuracy, the method comprising:
obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames;
extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data,
wherein extracting the plurality of video tokens from the video data comprises:
extracting, by the computing system, a plurality of video tubelets from the video data, the plurality of video tublets respectively comprising a length and a width and spanning two or more video frames of the plurality of video frames;
projecting, by the computing system, the plurality of video tubelets to a plurality of tensor representations of the plurality of video tubelets; and
merging, by the computing system, the plurality of tensor representations along at least one dimension to produce the plurality of video tokens;
providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and
receiving, by the computing system, a classification output from the video understanding model.