US 12,354,340 B2
	Fully attentional computer vision
Jonathon Shlens, San Francisco, CA (US); Ashish Teku Vaswani, San Francisco, CA (US); Niki J. Parmar, Sunnyvale, CA (US); Prajit Ramachandran, Santa Clara, CA (US); Anselm Caelifer Levskaya, Oakland, CA (US); and Irwan Bello, San Francisco, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 17/606,976
Filed by Google LLC, Mountain View, CA (US)
PCT Filed May 22, 2020, PCT No. PCT/US2020/034324 § 371(c)(1), (2) Date Oct. 27, 2021, PCT Pub. No. WO2020/237188, PCT Pub. Date Nov. 26, 2020.
Claims priority of provisional application 62/852,277, filed on May 23, 2019.
Prior Publication US 2022/0215654 A1, Jul. 7, 2022
Int. Cl. G06V 10/82 (2022.01); G06N 3/04 (2023.01); G06T 9/00 (2006.01)

CPC G06V 10/82 (2022.01) [G06N 3/04 (2013.01); G06T 9/002 (2013.01)]

17 Claims

1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement:

a computer vision neural network, the computer vision neural network comprising a positional local self-attention layer configured to receive an input feature map and to generate an output feature map that characterizes features of the input feature map using both local content and positional information of the input feature map, wherein the positional local self-attention layer is configured to:

for each of a plurality of input elements in the input feature map, generate a respective output element for the output feature map, the generating comprising:

determining, for the input element, a plurality of neighboring input elements around the input element of the input feature map,

generating a query vector using the input element and a query weight matrix,

for each neighboring element, performing the following positional local self-attention operations:

generating a key vector using the neighboring element and a key weight matrix,

generating a positional value vector using the neighboring element and one or more positional value weight matrices, wherein the one or more positional value weight matrices represent spatial distance between the input element to each of its neighboring input elements, and

generating a temporary output element using the query vector, the key vector, and the positional value vector, comprising:

generating a query-key product by taking a dot product of the query vector and the key vector,

generating a positional query-key product based on the query-key product,

generating an intermediate output by applying a softmax operation on the positional query-key product, and

generating the temporary output element by computing a product of the intermediate output and the positional value vector, and

generating the respective output element by summing temporary output elements of the neighboring elements.