US 12,469,281 B2
	Processing video content using gated transformer neural networks
Yawei Li, Erlenbach (CH); Bert Moons, Antwerp (BE); Tijmen Pieter Frederik Blankevoort, Amsterdam (NL); Amirhossein Habibian, Amsterdam (NL); and Babak Ehteshami Bejnordi, Amsterdam (NL)
Assigned to QUALCOMM INCORPORATED, San Diego, CA (US)
Filed by QUALCOMM Incorporated, San Diego, CA (US)
Filed on Sep. 20, 2022, as Appl. No. 17/933,840.
Claims priority of provisional application 63/246,643, filed on Sep. 21, 2021.
Prior Publication US 2023/0090941 A1, Mar. 23, 2023
Int. Cl. G06V 20/40 (2022.01); G06V 10/77 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01)

CPC G06V 20/41 (2022.01) [G06V 10/7715 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01)]

29 Claims

1. A processor-implemented method for processing a video stream using a transformer neural network, the method comprising:

obtaining, at a transformer neural network, a first group of tokens from a first frame of the video stream and a second group of tokens from a second frame of the video stream;

based on a comparison of tokens from the first group of tokens to corresponding tokens in the second group of tokens:

identifying a first set of tokens associated with first features to be reused from the first frame; and

identifying a second set of tokens associated with second features to be computed from the second frame;

computing, at the transformer neural network, the second features, wherein the computing the second features comprises:

converting, using a plurality of linear projection layers of a self-attention module of the transformer neural network, only the second set of tokens to queries, keys, and values;

generating an attention map based on a combination of the queries and the keys;

generating a first output set of tokens based on a combination of the attention map and the values; and

processing the first output set of tokens at an output projection layer of the self-attention module to generate a second output set of tokens; and

combining the first features associated with the first set of tokens with the second features associated with the second group of tokens into a representation of the second frame of the video stream.