US 11,743,551 B2
Video caption generating method and apparatus, device, and storage medium
Wenjie Pei, Shenzhen (CN); Jiyuan Zhang, Shenzhen (CN); Lei Ke, Shenzhen (CN); Yuwing Tai, Shenzhen (CN); Xiaoyong Shen, Shenzhen (CN); Jiaya Jia, Shenzhen (CN); and Xiangrong Wang, Shenzhen (CN)
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed by TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, Shenzhen (CN)
Filed on May 24, 2021, as Appl. No. 17/328,970.
Application 17/328,970 is a continuation of application No. PCT/CN2020/081721, filed on Mar. 27, 2020.
Claims priority of application No. 201910325193.0 (CN), filed on Apr. 22, 2019.
Prior Publication US 2021/0281774 A1, Sep. 9, 2021
Int. Cl. G06K 9/36 (2006.01); G06K 9/46 (2006.01); H04N 21/488 (2011.01); H04N 5/278 (2006.01); G06V 20/40 (2022.01); G06F 18/22 (2023.01); G06F 18/28 (2023.01); G06F 18/25 (2023.01); G06V 10/75 (2022.01); G06V 10/772 (2022.01); G06V 20/62 (2022.01); H04N 21/234 (2011.01); H04N 21/235 (2011.01); H04N 21/435 (2011.01); H04N 21/8549 (2011.01)
CPC H04N 21/4884 (2013.01) [G06F 18/22 (2023.01); G06F 18/253 (2023.01); G06F 18/28 (2023.01); G06V 10/75 (2022.01); G06V 10/772 (2022.01); G06V 20/41 (2022.01); G06V 20/47 (2022.01); G06V 20/635 (2022.01); H04N 5/278 (2013.01); H04N 21/235 (2013.01); H04N 21/23418 (2013.01); H04N 21/435 (2013.01); H04N 21/488 (2013.01); H04N 21/8549 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A video caption generating method, performed by a computer device, the method comprising:
encoding a target video by using an encoder of a video caption generating model, to obtain a target visual feature of the target video;
decoding the target visual feature by using a basic decoder of the video caption generating model, to obtain a first selection probability corresponding to each candidate word of a plurality of candidate words;
decoding the target visual feature of the target video by using an auxiliary decoder of the video caption generating model, to obtain a second selection probability corresponding to the each candidate word, wherein a memory of the auxiliary decoder stores reference visual context information corresponding to the each candidate word, and the reference visual context information has been generated according to at least one related video corresponding to the each candidate word;
determining a decoded word from the plurality of candidate words according to the first selection probability and the second selection probability of the each candidate word; and
generating a video caption corresponding to the target video according to the decoded word.