| CPC G06V 10/82 (2022.01) [G06N 3/04 (2013.01); G06T 9/002 (2013.01)] | 17 Claims |

|
1. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement:
a computer vision neural network, the computer vision neural network comprising a positional local self-attention layer configured to receive an input feature map and to generate an output feature map that characterizes features of the input feature map using both local content and positional information of the input feature map, wherein the positional local self-attention layer is configured to:
for each of a plurality of input elements in the input feature map, generate a respective output element for the output feature map, the generating comprising:
determining, for the input element, a plurality of neighboring input elements around the input element of the input feature map,
generating a query vector using the input element and a query weight matrix,
for each neighboring element, performing the following positional local self-attention operations:
generating a key vector using the neighboring element and a key weight matrix,
generating a positional value vector using the neighboring element and one or more positional value weight matrices, wherein the one or more positional value weight matrices represent spatial distance between the input element to each of its neighboring input elements, and
generating a temporary output element using the query vector, the key vector, and the positional value vector, comprising:
generating a query-key product by taking a dot product of the query vector and the key vector,
generating a positional query-key product based on the query-key product,
generating an intermediate output by applying a softmax operation on the positional query-key product, and
generating the temporary output element by computing a product of the intermediate output and the positional value vector, and
generating the respective output element by summing temporary output elements of the neighboring elements.
|