US 12,254,668 B1
	Unstructured object matching in image data
Ioana-Sabina Stoian, Iaşi (RO); Alin-Ionut Popa, Bucharest (RO); Ionut Catalin Sandu, Bucharest (RO); and Daniel Voinea, Vladiceasca (RO)
Assigned to AMAZON TECHNOLOGIES, INC., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 30, 2022, as Appl. No. 17/708,469.
Int. Cl. G06V 10/74 (2022.01); G06N 3/0455 (2023.01); G06N 3/08 (2023.01); G06T 7/174 (2017.01); G06V 10/26 (2022.01); G06V 10/28 (2022.01); G06V 10/34 (2022.01); G06V 10/44 (2022.01); G06V 10/75 (2022.01); G06V 10/77 (2022.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01)

CPC G06V 10/761 (2022.01) [G06N 3/0455 (2023.01); G06N 3/08 (2013.01); G06T 7/174 (2017.01); G06V 10/26 (2022.01); G06V 10/28 (2022.01); G06V 10/34 (2022.01); G06V 10/443 (2022.01); G06V 10/751 (2022.01); G06V 10/7715 (2022.01); G06V 10/806 (2022.01); G06V 10/82 (2022.01); G06T 2207/20036 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/30176 (2013.01)]

20 Claims

1. A method of matching different depictions of objects in image data, the method comprising:

receiving first image data comprising a first representation of an object;

receiving second image data comprising a second representation of the object, wherein the second representation of the object includes a different view of the object in the second image data relative to the first image data;

processing, by an encoder network, the first image data and the second image data by interleaving signals from the first image data and the second image data to generate a first feature map representing the first image data and a second feature map representing the second image data;

concatenating the first feature map and the second feature map to generate a combined feature map, wherein the combined feature map spatially overlaps common features from the first feature map and the second feature map;

computing a set of correlation scores for the combined feature map;

determining, using the set of correlation scores, a co-salient region of the combined feature map;

generating, by inputting the combined feature map into a first segmentation head, a first segmentation mask representing foreground image data for the co-salient region detected in the first image data;

generating, by inputting the combined feature map into a second segmentation head, a second segmentation mask representing foreground image data for the co-salient region detected in the second image data;

filtering the first feature map using the first segmentation mask to generate first data comprising the first representation of the object as represented in the first feature map;

filtering the second feature map using the second segmentation mask to generate second data comprising the second representation of the object as represented in the second feature map;

comparing the first data and the second data using a cosine similarity metric to generate a similarity matrix; and

determining, using a first convolutional neural network and the similarity matrix, that the object represented in the first image data matches the object represented in the second image data.