US 12,309,404 B2
Contextual video compression framework with spatial-temporal cross-covariance transformers
Zhenghao Chen, Sydney (AU); Roberto Gerson De Albuquerque Azevedo, Zurich (CH); Christopher Richard Schroers, Uster (CH); Yang Zhang, Dubendorf (CH); and Lucas Relic, Zurich (CH)
Assigned to Disney Enterprises, Inc., Burbank, CA (US); and ETH Zürich (Eidgenössische Technische Hochschule Zürich), Zürich (CH)
Filed by Disney Enterprises, Inc., Burbank, CA (US); and ETH Zürich (Eidgenössische Technische Hochschule Zürich), Zürich (CH)
Filed on Jul. 7, 2023, as Appl. No. 18/349,076.
Claims priority of provisional application 63/488,944, filed on Mar. 7, 2023.
Prior Publication US 2024/0305801 A1, Sep. 12, 2024
Int. Cl. H04N 7/12 (2006.01); H04N 19/172 (2014.01); H04N 19/42 (2014.01); H04N 19/91 (2014.01)
CPC H04N 19/42 (2014.11) [H04N 19/172 (2014.11); H04N 19/91 (2014.11)] 20 Claims
OG exemplary drawing
 
1. A system comprising:
a first component to extract temporal features from a current frame being coded and a previous frame of a video, wherein three-dimensional based joint features are determined using the temporal features and spatial features from the current frame;
a second component that uses a first transformer to receive the three-dimensional based joint features as input and fuse the spatial features from the current frame with the temporal features to generate spatio-temporal features as first output;
a third component that uses a second transformer to perform entropy coding using the first output and at least a portion of the temporal features to generate a second output, wherein the second transformer is used to fuse the spatio-temporal features with the at least a portion of the temporal features to output fused spatio-temporal features that are entropy encoded to generate the second output; and
a fourth component that uses a third transformer to reconstruct the current frame, wherein the first output is processed using the second output to generate third output, and wherein the third transformer fuses the temporal features with the third output.