| CPC G06V 20/49 (2022.01) [G06F 16/735 (2019.01); G06F 16/784 (2019.01); G06T 7/11 (2017.01); H04N 21/47205 (2013.01); G06T 2207/10021 (2013.01)] | 20 Claims |

|
1. A computer-implemented method, comprising:
receiving a query video sequence and memory data, the memory data including a memory video frame from the query video sequence and an annotated memory video frame corresponding to the memory video frame, the annotated memory video frame including an object mask for an object in the memory video frame;
segmenting the query video sequence into a plurality of query video clips;
generating, by an encoder, features from a first set of query video frames of a first query video clip of the plurality of query video clips and the memory data;
generating intra-clip value features for the first set of query video frames based on the features generated from the first set of query video frames and the memory data;
predicting a modified set of query video frames using the intra-clip value features, the modified set of query video frames including predictions of object masks for the object in each query video frame of the first set of query video frames; and
updating the memory data to include one or more frames of the first set of query video frames and one or more frames of the modified set of query video frames corresponding to the one or more frames of the first set of query video frames.
|