US 12,482,251 B2
Systems, methods and techniques for learning and using sparse instance-dependent attention for efficient vision transformers
Cong Wei, Toronto (CA); Brendan Duke, Toronto (CA); Ruowei Jiang, Toronto (CA); and Parham Aarabi, Richmond Hill (CA)
Assigned to L'OREAL, Paris (FR)
Filed by L'OREAL, Paris (FR)
Filed on Apr. 27, 2023, as Appl. No. 18/140,055.
Prior Publication US 2024/0362902 A1, Oct. 31, 2024
Int. Cl. G09G 5/00 (2006.01); G06T 7/246 (2017.01); G06T 11/60 (2006.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/20 (2022.01); G06V 40/16 (2022.01)
CPC G06V 10/82 (2022.01) [G06T 7/246 (2017.01); G06T 11/60 (2013.01); G06V 10/764 (2022.01); G06V 20/20 (2022.01); G06V 40/161 (2022.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30201 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A computing device comprising a processor and a non- transitory storage device storing instructions that, when executed by the processor, cause the computing device to perform steps for image processing an image or a series of images, the steps comprising:
storing a deep neural network model defining a Vision Transformer (ViT); and
processing the image or the series of images with the ViT to provide the image processing for the image or the series of images;
wherein the ViT comprises a plurality of multi-head self-attention (MHSA) modules arranged in successive layers and each MHSA attention module is configured to, in respect of a layer l of the plurality of layers: use instance-dependent and meaningful sparse full-rank attention patterns that limit the number of connections as determined by a trained lightweight connectivity predictor module to estimate a connectivity score of each pair of input tokens for the layer l, including determining a low rank approximation matrix Adown of a full-rank attention matrix A, thresholding elements of matrix Adown to provide a sparse matrix Ãdown; and using matrix Ãdown for performing sparse matrix computation techniques to provide the sparse full-rank attention patterns as a sparse full-rank attention matrix à to accelerate the ViT.