| CPC G06V 20/41 (2022.01) [G06V 10/7715 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01)] | 29 Claims |

|
1. A processor-implemented method for processing a video stream using a transformer neural network, the method comprising:
obtaining, at a transformer neural network, a first group of tokens from a first frame of the video stream and a second group of tokens from a second frame of the video stream;
based on a comparison of tokens from the first group of tokens to corresponding tokens in the second group of tokens:
identifying a first set of tokens associated with first features to be reused from the first frame; and
identifying a second set of tokens associated with second features to be computed from the second frame;
computing, at the transformer neural network, the second features, wherein the computing the second features comprises:
converting, using a plurality of linear projection layers of a self-attention module of the transformer neural network, only the second set of tokens to queries, keys, and values;
generating an attention map based on a combination of the queries and the keys;
generating a first output set of tokens based on a combination of the attention map and the values; and
processing the first output set of tokens at an output projection layer of the self-attention module to generate a second output set of tokens; and
combining the first features associated with the first set of tokens with the second features associated with the second group of tokens into a representation of the second frame of the video stream.
|