US 11,915,474 B2
Regional-to-local attention for vision transformers
Richard Chen, Baldwin Place, NY (US); Rameswar Panda, Medford, MA (US); and Quanfu Fan, Lexington, MA (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on May 31, 2022, as Appl. No. 17/804,724.
Prior Publication US 2023/0386197 A1, Nov. 30, 2023
Int. Cl. G06V 10/96 (2022.01); G06V 10/25 (2022.01); G06V 10/77 (2022.01)
CPC G06V 10/96 (2022.01) [G06V 10/25 (2022.01); G06V 10/7715 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item;
generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item;
generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and
performing at least one vision task based on the at least one feature map.