US 12,094,208 B2
Video classification method, electronic device and storage medium
Hu Yang, Beijing (CN); Feng He, Beijing (CN); Qi Wang, Beijing (CN); Zhifan Feng, Beijing (CN); Chunguang Chai, Beijing (CN); and Yong Zhu, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed by BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed on Oct. 15, 2021, as Appl. No. 17/502,173.
Claims priority of application No. 202110244368.2 (CN), filed on Mar. 5, 2021.
Prior Publication US 2022/0284218 A1, Sep. 8, 2022
Int. Cl. G06K 9/62 (2022.01); G06F 18/214 (2023.01); G06F 18/241 (2023.01); G06F 18/25 (2023.01); G06N 20/00 (2019.01); G06V 10/22 (2022.01); G06V 10/40 (2022.01); G06V 10/70 (2022.01); G06V 10/764 (2022.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 20/62 (2022.01); G06V 20/70 (2022.01); G10L 15/08 (2006.01); G06N 3/08 (2023.01); G06V 30/10 (2022.01)
CPC G06V 20/46 (2022.01) [G06F 18/214 (2023.01); G06F 18/241 (2023.01); G06F 18/253 (2023.01); G06N 20/00 (2019.01); G06V 10/22 (2022.01); G06V 10/40 (2022.01); G06V 10/764 (2022.01); G06V 10/768 (2022.01); G06V 10/806 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/635 (2022.01); G06V 20/70 (2022.01); G10L 15/08 (2013.01); G06N 3/08 (2013.01); G06V 30/10 (2022.01)] 16 Claims
OG exemplary drawing
 
1. A video classification method, comprising:
extracting a keyword in a video according to multi-modal information of the video;
acquiring background knowledge corresponding to the keyword, and determining a text to be recognized according to the keyword and the background knowledge; and
classifying the text to be recognized to obtain a class of the video,
wherein the extracting a keyword in a video according to multi-modal information of the video comprises:
performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information;
fusing the features corresponding to each piece of modal information to obtain a fused feature; and
performing a word labeling according to the fused feature in the video to determine the keyword in the video,
wherein the multi-modal information comprises text content and visual information, the visual information comprises first visual information and second visual information, the first visual information is visual information corresponding to a text in a video frame in the video, the second visual information is a key frame in the video, and the performing feature extraction on each piece of modal information in the multi-modal information, so as to obtain features corresponding to each piece of modal information comprises:
performing a first text encoding operation on the text content to obtain a text feature;
performing a second text encoding operation on the first visual information to obtain a first visual feature; and
performing an image encoding operation on the second visual information to obtain a second visual feature.