US 12,236,668 B2
Efficientformer vision transformer
Jian Ren, Hermosa Beach, CA (US); Yang Wen, San Jose, CA (US); Ju Hu, Los Angeles, CA (US); Georgios Evangelidis, Vienna (AT); Sergey Tulyakov, Santa Monica, CA (US); Yanyu Li, Malden, MA (US); and Geng Yuan, Medford, MA (US)
Assigned to Snap Inc., Santa Monica, CA (US)
Filed by Jian Ren, Hermosa Beach, CA (US); Yang Wen, San Jose, CA (US); Ju Hu, Los Angeles, CA (US); Georgios Evangelidis, Vienna (AT); Sergey Tulyakov, Santa Monica, CA (US); Yanyu Li, Malden, MA (US); and Geng Yuan, Medford, MA (US)
Filed on Jul. 14, 2022, as Appl. No. 17/865,178.
Prior Publication US 2024/0020948 A1, Jan. 18, 2024
Int. Cl. G06K 9/00 (2022.01); G06V 10/44 (2022.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01)
CPC G06V 10/764 (2022.01) [G06V 10/454 (2022.01); G06V 10/82 (2022.01)] 15 Claims
OG exemplary drawing
 
7. A method of using a vision transformer, comprising:
receiving and convoluting an image using a convolution stem to patch embed the image;
receiving and processing the patch embedded image using a first stack of stages stages including at least two stages of 4-Dimension metablocks (MBs) (MB4D); and
receiving the processed image from the first stack of stages and further processing the processed image using a second stack of stages including 3-Dimension MBs (MB3D), wherein each of the MB4D stages and each of the MB3D stages include different layer configurations, wherein each of the MB4D stages and each of the MB3D stages include a token mixer, and wherein each of the MB4D stages and each of the MB3D stages include pooling and multi-head self-attention, arranged in a dimension-consistent manner.