US 12,236,668 B2
	Efficientformer vision transformer
Jian Ren, Hermosa Beach, CA (US); Yang Wen, San Jose, CA (US); Ju Hu, Los Angeles, CA (US); Georgios Evangelidis, Vienna (AT); Sergey Tulyakov, Santa Monica, CA (US); Yanyu Li, Malden, MA (US); and Geng Yuan, Medford, MA (US)
Assigned to Snap Inc., Santa Monica, CA (US)
Filed by Jian Ren, Hermosa Beach, CA (US); Yang Wen, San Jose, CA (US); Ju Hu, Los Angeles, CA (US); Georgios Evangelidis, Vienna (AT); Sergey Tulyakov, Santa Monica, CA (US); Yanyu Li, Malden, MA (US); and Geng Yuan, Medford, MA (US)
Filed on Jul. 14, 2022, as Appl. No. 17/865,178.
Prior Publication US 2024/0020948 A1, Jan. 18, 2024
Int. Cl. G06K 9/00 (2022.01); G06V 10/44 (2022.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01)

CPC G06V 10/764 (2022.01) [G06V 10/454 (2022.01); G06V 10/82 (2022.01)]

15 Claims

7. A method of using a vision transformer, comprising:

receiving and convoluting an image using a convolution stem to patch embed the image;

receiving and processing the patch embedded image using a first stack of stages stages including at least two stages of 4-Dimension metablocks (MBs) (MB^4D); and

receiving the processed image from the first stack of stages and further processing the processed image using a second stack of stages including 3-Dimension MBs (MB^3D), wherein each of the MB^4Dstages and each of the MB^3Dstages include different layer configurations, wherein each of the MB^4Dstages and each of the MB^3Dstages include a token mixer, and wherein each of the MB^4Dstages and each of the MB^3Dstages include pooling and multi-head self-attention, arranged in a dimension-consistent manner.