CPC G06V 10/44 (2022.01) [G06V 10/25 (2022.01); G06V 10/764 (2022.01); G06V 40/20 (2022.01); H04N 21/4884 (2013.01)] | 11 Claims |
1. A method of constructing a transformer model for answering questions about a video story, the method comprising:
extracting feature vectors related to each character of a video from video data including vision data and subtitle data and question data for video questions and answers, and generating an input embedding using the feature vectors related to the character; and
training a transformer model using the input embedding,
wherein generating the input embedding comprises:
classifying the vision data, the subtitle data, and the question data into a plurality of categories;
extracting feature vectors for the plurality of respective categories;
generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and
generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding, and
wherein the plurality of categories includes one or more categories related to features of the character.
|