US 12,283,087 B2
Model training method, media information synthesis method, and related apparatuses
Haozhi Huang, Shenzhen (CN); Jiawei Li, Shenzhen (CN); Li Shen, Shenzhen (CN); Yonggen Ling, Shenzhen (CN); Wei Liu, Shenzhen (CN); and Dong Yu, Bothell, WA (US)
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed by TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed on Dec. 1, 2020, as Appl. No. 17/109,072.
Application 17/109,072 is a continuation of application No. PCT/CN2020/113118, filed on Sep. 3, 2020.
Claims priority of application No. 201911140015.7 (CN), filed on Nov. 19, 2019.
Prior Publication US 2021/0152751 A1, May 20, 2021
Int. Cl. G06N 20/00 (2019.01); G06F 18/21 (2023.01); G06F 18/214 (2023.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 40/10 (2022.01); G06V 40/16 (2022.01); H04N 5/265 (2006.01)
CPC G06V 10/764 (2022.01) [G06F 18/2148 (2023.01); G06F 18/217 (2023.01); G06N 20/00 (2019.01); G06V 10/7747 (2022.01); G06V 40/10 (2022.01); G06V 40/171 (2022.01); H04N 5/265 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A model training method, the method comprising:
obtaining an image sample set and brief-prompt information, the image sample set comprising at least one image sample, the brief-prompt information representing key-point information of a to-be-trained object in the at least one image sample, wherein the at least one image sample includes a plurality of consecutive image samples, and the plurality of consecutive image samples are used for forming a video sample;
generating a content mask set according to the image sample set and the brief-prompt information, the content mask set comprising at least one content mask, the at least one content mask being obtained by extending outward a region identified according to the brief-prompt information in the at least one image sample;
generating a to-be-trained image set according to the content mask set, the to-be-trained image set comprising at least one to-be-trained image, the at least one to-be-trained image being in correspondence to the at least one image sample;
obtaining, based on the image sample set and the to-be-trained image set, a predicted image set through a to-be-trained information synthesis model, the predicted image set comprising at least one predicted image, the at least one predicted image being in correspondence to the at least one image sample; and
training, based on the predicted image set and the image sample set, the to-be-trained information synthesis model by using a target loss function, to obtain an information synthesis model, comprising:
determining a first loss function according to N frames of predicted images in the predicted image set, N frames of to-be-trained images in the to-be-trained image set, and N frames of image samples in the image sample set, N being an integer greater than 1, wherein the first loss function is determined based on an output of a generator of the to-be-trained information synthesis model when inputting a superposition of (N-1) frames of to-be-trained images and an Nth frame of to-be-trained image to the generator;
determining a second loss function according to N frames of predicted images in the predicted image set and N frames of image samples in the image sample set;
determining the target loss function according to the first loss function and the second loss function;
iteratively updating a model parameter of the to-be-trained information synthesis model according to the target loss function; and
generating, in a case that an iteration end condition is satisfied, the information synthesis model according to the model parameter of the to-be-trained information synthesis model.