CPC G06F 18/22 (2023.01) [G06F 16/7343 (2019.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); G06V 20/46 (2022.01); G06V 20/44 (2022.01)] | 17 Claims |
1. A method of real-time video event detection comprising:
obtaining, based on a natural language query, a query vector,
performing multimodal feature extraction on a video stream to obtain a video vector,
obtaining a similarity score by comparing the query vector to the video vector;
comparing the similarity score to a predetermined threshold; and
activating, based on the similarity score being above the predetermined threshold, an action trigger,
wherein the performing of the multimodal feature extraction is performed using a plurality of overlapping windows that include sequential frames of the video stream,
wherein the performing of the multimodal feature extraction comprises:
obtaining a latency constraint for the performing of the multimodal feature extraction; and
selecting a plurality of final feature extractors, among a plurality of predetermined feature extractors corresponding to a plurality of modalities, based on the latency constraint, predetermined performances of the plurality of predetermined feature extractors, and predetermined latencies of the plurality of predetermined feature extractors.
|