| CPC G06V 10/82 (2022.01) [G06T 7/246 (2017.01); G06T 11/60 (2013.01); G06V 10/764 (2022.01); G06V 20/20 (2022.01); G06V 40/161 (2022.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30201 (2013.01)] | 21 Claims |

|
1. A computing device comprising a processor and a non- transitory storage device storing instructions that, when executed by the processor, cause the computing device to perform steps for image processing an image or a series of images, the steps comprising:
storing a deep neural network model defining a Vision Transformer (ViT); and
processing the image or the series of images with the ViT to provide the image processing for the image or the series of images;
wherein the ViT comprises a plurality of multi-head self-attention (MHSA) modules arranged in successive layers and each MHSA attention module is configured to, in respect of a layer l of the plurality of layers: use instance-dependent and meaningful sparse full-rank attention patterns that limit the number of connections as determined by a trained lightweight connectivity predictor module to estimate a connectivity score of each pair of input tokens for the layer l, including determining a low rank approximation matrix Adown of a full-rank attention matrix A, thresholding elements of matrix Adown to provide a sparse matrix Ãdown; and using matrix Ãdown for performing sparse matrix computation techniques to provide the sparse full-rank attention patterns as a sparse full-rank attention matrix à to accelerate the ViT.
|