| CPC G06V 20/44 (2022.01) [G06V 10/764 (2022.01); G06V 20/41 (2022.01)] | 19 Claims |

|
1. A computer-implemented method comprising:
training a machine-learning (ML) model with training data comprising text-image pairs to create a semantic model, wherein the training data comprises embeddings for video frames, embeddings for object crops in the video frames, embeddings for text describing the video frames, embeddings for sensor readings associated with the video frames, and embeddings for audio associated with the video frames;
receiving a request to search for an event, the request comprising plain text with a description of the event in plain language;
generating, by the semantic model, a text embedding based on the description of the event in plain language;
accessing video data for an area being monitored;
determining metadata based on an image frame from the video data, the metadata configured to include attributes defining bounding boxes for objects detected in the image frame, people in the image frame, object identity, object color, object position, object movement, and vehicle information when a vehicle is detected in the image frame;
generating, by the semantic model, an image embedding of the image frame from the video data, the image embedding comprising information about pixels in the image frame and the metadata determined from the image frame;
calculating a similarity value between the text embedding and the image embedding; and
determining an occurrence of the event described in the request based on the similarity value being greater than a predetermined threshold.
|