US 12,299,967 B2
	Electronic device and operation method thereof for multimodal temporal-axis fusion artificial intelligence models
Dong Chan Park, Seoul (KR); and Mobeen Ahmad, Seoul (KR)
Assigned to Pyler Co., Ltd., Seoul (KR)
Filed by PYLER CO., LTD., Seoul (KR)
Filed on Aug. 20, 2024, as Appl. No. 18/809,395.
Claims priority of application No. 10-2024-0029703 (KR), filed on Feb. 29, 2024.
Prior Publication US 2025/0078486 A1, Mar. 6, 2025
Int. Cl. G06K 9/00 (2022.01); G06V 10/62 (2022.01); G06V 10/764 (2022.01); G06V 10/80 (2022.01); G06V 30/18 (2022.01)

CPC G06V 10/806 (2022.01) [G06V 10/62 (2022.01); G06V 10/764 (2022.01); G06V 30/18 (2022.01)]

22 Claims

1. An electronic device for multimodal temporal-axis fusion artificial intelligence models, the electronic device comprising:

a storage unit; and

a processor,

wherein the processor obtains a plurality of first visual features respectively corresponding to a plurality of different time points or time periods from a video, obtains a plurality of text features respectively corresponding to the plurality of time points or time periods from text, obtains a plurality of first local fusion features, respectively corresponding to the plurality of time points or time periods, from the plurality of first visual features and the plurality of text features by fusing the first visual features and the text features, which correspond to a same time point or time period, and obtains at least one global fusion feature from the plurality of first local fusion features, and

in order to obtain the plurality of text features, the processor obtains a plurality of first intermediate text features from the text, obtains a plurality of second intermediate text features respectively corresponding to the plurality of time points or time periods from the plurality of first intermediate text features by using a plurality of mapping layers respectively corresponding to the plurality of time points or time periods, and obtains the plurality of text features from the plurality of second intermediate text features by changing dimensions of the second intermediate text features.