CPC G06V 10/806 (2022.01) [G06V 10/62 (2022.01); G06V 10/764 (2022.01); G06V 30/18 (2022.01)] | 22 Claims |
1. An electronic device for multimodal temporal-axis fusion artificial intelligence models, the electronic device comprising:
a storage unit; and
a processor,
wherein the processor obtains a plurality of first visual features respectively corresponding to a plurality of different time points or time periods from a video, obtains a plurality of text features respectively corresponding to the plurality of time points or time periods from text, obtains a plurality of first local fusion features, respectively corresponding to the plurality of time points or time periods, from the plurality of first visual features and the plurality of text features by fusing the first visual features and the text features, which correspond to a same time point or time period, and obtains at least one global fusion feature from the plurality of first local fusion features, and
in order to obtain the plurality of text features, the processor obtains a plurality of first intermediate text features from the text, obtains a plurality of second intermediate text features respectively corresponding to the plurality of time points or time periods from the plurality of first intermediate text features by using a plurality of mapping layers respectively corresponding to the plurality of time points or time periods, and obtains the plurality of text features from the plurality of second intermediate text features by changing dimensions of the second intermediate text features.
|