US 12,278,969 B2
Codec rate distortion compensating downsampler
Christopher Richard Schroers, Uster (CH); Roberto Gerson de Albuquerque Azevedo, Zurich (CH); Nicholas David Gregory, Zurich (CH); Yuanyi Xue, Alameda, CA (US); Scott Labrozzi, Cary, NC (US); and Abdelaziz Djelouah, Zurich (CH)
Assigned to Disney Enterprises, Inc., Burbank, CA (US); and ETH Zurich (Eidgenossische Technische Hochschule Zurich), Zurich (CH)
Filed by Disney Enterprises, Inc., Burbank, CA (US); and ETH Zurich (EIDGENÖSSISCHE TECHNISCHE HOCHSCHULE ZURICH), Zürich (CH)
Filed on Aug. 4, 2023, as Appl. No. 18/230,409.
Application 18/230,409 is a continuation of application No. 17/500,373, filed on Oct. 13, 2021, granted, now 11,765,360.
Prior Publication US 2023/0379475 A1, Nov. 23, 2023
Int. Cl. H04N 19/147 (2014.01); G06N 3/08 (2023.01); G06T 3/4046 (2024.01); G06T 9/00 (2006.01); H04N 19/132 (2014.01); H04N 19/184 (2014.01)
CPC H04N 19/147 (2014.11) [G06N 3/08 (2013.01); G06T 3/4046 (2013.01); G06T 9/002 (2013.01); H04N 19/132 (2014.11); H04N 19/184 (2014.11)] 18 Claims
OG exemplary drawing
 
1. A video processing system comprising:
an upsampler;
a video codec;
a trained machine learning (ML) model-based video downsampler trained using a neural network-based (NN-based) proxy video codec; and
a processing hardware configured to:
receive an input video sequence having a first display resolution;
extract a content sample of the input video sequence;
map, using the trained ML model-based video downsampler, the content sample to a lower resolution sample;
transform, using one of the video codec or the NN-based proxy video codec, the lower resolution sample into a decoded sample bitstream;
predict, using the upsampler and the decoded sample bitstream, an output sample corresponding to the content sample; and
modify, based on the predicted output sample, one or more parameters of the trained ML model-based video downsampler;
wherein the ML model-based video downsampler is trained using the input video sequence, the output sample, and an objective function based on an estimated rate of the lower resolution sample and a plurality of perceptual loss functions.