US 11,887,270 B2
	Multi-scale transformer for image analysis
Junjie Ke, East Palo Alto, CA (US); Feng Yang, Sunnyvale, CA (US); Qifei Wang, Mountain View, CA (US); Yilin Wang, Sunnyvale, CA (US); and Peyman Milanfar, Menlo Park, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 17/787,699
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Jul. 1, 2021, PCT No. PCT/US2021/040111 § 371(c)(1), (2) Date Jun. 21, 2022, PCT Pub. No. WO2023/277919, PCT Pub. Date Jan. 5, 2023.
Prior Publication US 2023/0222623 A1, Jul. 13, 2023
Int. Cl. G06K 9/00 (2022.01); G06T 3/00 (2006.01); G06T 3/40 (2006.01); G06T 7/00 (2017.01)

CPC G06T 3/0012 (2013.01) [G06T 3/40 (2013.01); G06T 7/0002 (2013.01); G06T 2207/20016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/30168 (2013.01)]

22 Claims

1. A method for processing imagery, the method comprising:

constructing, by one or more processors, a multi-scale representation of a native resolution image, the multi-scale representation including the native resolution image and a set of aspect ratio preserving resized variants;

encoding, by the one or more processors, a corresponding spatial embedding for each patch associated with a respective region of either the native resolution image or one of the set of aspect ratio preserving resized variants, thereby forming a set of spatially encoded patches;

applying, by the one or more processors, a set of scale embeddings to the set of spatially encoded patches to capture scale information associated with the native resolution image and the set of aspect ratio resized variants, thereby forming a set of input tokens; and

performing, by the one or more processors according to a transformer encoder module, self-attention on the set of input tokens to create a final image representation.