US 12,130,891 B2
	Method of live video event detection based on natural language queries, and an apparatus for the same
Ning Ye, Toronto (CA); Zhiming Hu, Toronto (CA); Caleb Ryan Phillips, Toronto (CA); and Iqbal Ismail Mohomed, Toronto (CA)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Aug. 16, 2021, as Appl. No. 17/402,877.
Claims priority of provisional application 63/110,019, filed on Nov. 5, 2020.
Prior Publication US 2022/0138489 A1, May 5, 2022
Int. Cl. G06F 18/22 (2023.01); G06F 16/732 (2019.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); G06V 20/40 (2022.01)

CPC G06F 18/22 (2023.01) [G06F 16/7343 (2019.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); G06V 20/46 (2022.01); G06V 20/44 (2022.01)]

17 Claims

1. A method of real-time video event detection comprising:

obtaining, based on a natural language query, a query vector,

performing multimodal feature extraction on a video stream to obtain a video vector,

obtaining a similarity score by comparing the query vector to the video vector;

comparing the similarity score to a predetermined threshold; and

activating, based on the similarity score being above the predetermined threshold, an action trigger,

wherein the performing of the multimodal feature extraction is performed using a plurality of overlapping windows that include sequential frames of the video stream,

wherein the performing of the multimodal feature extraction comprises:

obtaining a latency constraint for the performing of the multimodal feature extraction; and

selecting a plurality of final feature extractors, among a plurality of predetermined feature extractors corresponding to a plurality of modalities, based on the latency constraint, predetermined performances of the plurality of predetermined feature extractors, and predetermined latencies of the plurality of predetermined feature extractors.