US 12,309,526 B2
	Video image transmission method, device, interactive intelligent tablet and storage medium
Ming Yang, Guangdong (CN)
Assigned to Guangzhou Shiyuan Electronics Co., LTD., Guangzhou (CN); and Guangzhou Shizhen Information Technology Co., LTD., Guangzhou (CN)
Appl. No. 17/417,550
Filed by GUANGZHOU SHIYUAN ELECTRONICS CO., LTD., Guangdong (CN); and GUANGZHOU SHIZHEN INFORMATION TECHNOLOGY CO., LTD., Guangdong (CN)
PCT Filed Dec. 24, 2019, PCT No. PCT/CN2019/127770 § 371(c)(1), (2) Date Jun. 23, 2021, PCT Pub. No. WO2020/151443, PCT Pub. Date Jul. 30, 2020.
Claims priority of application No. 201910063004.7 (CN), filed on Jan. 23, 2019.
Prior Publication US 2022/0051024 A1, Feb. 17, 2022
Int. Cl. H04N 7/14 (2006.01); G06F 18/25 (2023.01); G06N 3/08 (2023.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 40/10 (2022.01); G06V 40/16 (2022.01); H04N 7/15 (2006.01)

CPC H04N 7/147 (2013.01) [G06F 18/251 (2023.01); G06F 18/253 (2023.01); G06N 3/08 (2013.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 40/107 (2022.01); G06V 40/16 (2022.01); G06V 40/174 (2022.01); H04N 7/155 (2013.01)]

23 Claims

1. A video image transmission method, comprising:

acquiring a video image captured by a first video communication end;

determining an encoding mode, wherein the encoding mode comprises one of a preset object mode;

recognizing the preset object in the video image to obtain a sub-image of the preset object;

providing the sub-image of the preset object to a trained neural network, wherein the trained neural network comprises an encoder comprising a series of one or more convolution layers and a middle layer to sequentially process the sub-image, and wherein the one or more convolution layers comprise a lower convolution layer whose output is fed to the middle layer;

executing the trained neural network to output a part of feature vectors extracted from the lower convolution layer and a low-dimensional vector from the middle layer, the low-dimensional vector representing semantic information of the preset object in the video image; and

sending, through a communication network, the part of the feature vectors extracted from the lower convolution layer and the low-dimensional vector representing the semantic information to a second video communication end, wherein the semantic information is used by a decoder to reconstruct a reconstruction image of the video image at the second video communication end.