US 12,137,222 B2
Method and apparatus for video coding for machine vision
Wen Gao, West Windsor, NJ (US); Xiaozhong Xu, State College, PA (US); and Shan Liu, San Jose, CA (US)
Assigned to TENCENT AMERICA LLC, Palo Alto, CA (US)
Filed by Tencent America LLC, Palo Alto, CA (US)
Filed on Sep. 22, 2022, as Appl. No. 17/950,564.
Claims priority of provisional application 63/277,517, filed on Nov. 9, 2021.
Prior Publication US 2023/0144455 A1, May 11, 2023
Int. Cl. H04N 19/124 (2014.01); G06V 10/774 (2022.01); H04N 19/132 (2014.01); H04N 19/91 (2014.01)
CPC H04N 19/124 (2014.11) [G06V 10/774 (2022.01); H04N 19/132 (2014.11); H04N 19/91 (2014.11)] 16 Claims
OG exemplary drawing
 
1. A method for encoding video for machine vision and human/machine hybrid vision, the method being executed by one or more processors, the method comprising:
receiving, at a hybrid codec, an input including at least one of video or image data, the hybrid codec including a first codec and a second codec, wherein the first codec is a traditional codec designed for human consumption and the second codec is a learning-based codec designed for machine vision;
compressing the input using the first codec, wherein the compressing includes down-sampling the input using a down-sampling module and up-sampling the compressed input using an up-sampling module producing a residual signal;
quantizing the residual signal to obtain a quantized representation of the input;
entropy encoding the quantized representation of the input using one or more convolutional filter modules; and
training one or more networks using the entropy encoded quantized representation,
wherein the up-sampled compressed input is subtracted from the input to generate a second residual signal,
wherein the second residual signal is provided to the learning-based codec,
wherein the output of the second codec is added on top of the up-sampled compressed input to form the reconstructed video for machine vision tasks, and
wherein training the one or more networks using the entropy encoded quantized representation comprises determining a value of an index specifying which of the machine vision tasks is targeted by the training.