CPC G06V 10/96 (2022.01) [G06V 10/25 (2022.01); G06V 10/7715 (2022.01)] | 20 Claims |
1. A computer-implemented method comprising:
generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item;
generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item;
generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and
performing at least one vision task based on the at least one feature map.
|