US 12,067,759 B2
Method of constructing transformer model for answering questions about video story and computing apparatus for performing the same
Byoung-Tak Zhang, Seoul (KR); and Seongho Choi, Seoul (KR)
Assigned to SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION, Seoul (KR)
Appl. No. 17/615,662
Filed by SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION, Seoul (KR)
PCT Filed Sep. 28, 2021, PCT No. PCT/KR2021/013257
§ 371(c)(1), (2) Date Dec. 1, 2021,
PCT Pub. No. WO2023/286914, PCT Pub. Date Jan. 19, 2023.
Claims priority of application No. 10-2021-0093486 (KR), filed on Jul. 16, 2021.
Prior Publication US 2024/0037896 A1, Feb. 1, 2024
Int. Cl. G06V 10/44 (2022.01); G06V 10/25 (2022.01); G06V 10/764 (2022.01); G06V 40/20 (2022.01); H04N 21/488 (2011.01)
CPC G06V 10/44 (2022.01) [G06V 10/25 (2022.01); G06V 10/764 (2022.01); G06V 40/20 (2022.01); H04N 21/4884 (2013.01)] 11 Claims
OG exemplary drawing
 
1. A method of constructing a transformer model for answering questions about a video story, the method comprising:
extracting feature vectors related to each character of a video from video data including vision data and subtitle data and question data for video questions and answers, and generating an input embedding using the feature vectors related to the character; and
training a transformer model using the input embedding,
wherein generating the input embedding comprises:
classifying the vision data, the subtitle data, and the question data into a plurality of categories;
extracting feature vectors for the plurality of respective categories;
generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and
generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding, and
wherein the plurality of categories includes one or more categories related to features of the character.