US 12,354,402 B2
	Landmark detection using deep neural network with multi-frequency self-attention
Pankaj Wasnik, Bangalore (IN); Aman Shenoy, Bangalore (IN); Naoyuki Onoe, Bangalore (IN); and Janani Ramaswamy, Bangalore (IN)
Assigned to SONY GROUP CORPORATION, Tokyo (JP)
Filed by SONY GROUP CORPORATION, Tokyo (JP)
Filed on Jan. 6, 2022, as Appl. No. 17/569,778.
Claims priority of provisional application 63/211,127, filed on Jun. 16, 2021.
Prior Publication US 2022/0406091 A1, Dec. 22, 2022
Int. Cl. G06V 40/16 (2022.01); G06N 3/045 (2023.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01)

CPC G06V 40/169 (2022.01) [G06N 3/045 (2023.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01); G06V 40/171 (2022.01)]

20 Claims

14. A method, comprising:

receiving, by an encoder network, a first input comprising an image of an object of interest;

generating, by the encoder network, multi-frequency feature maps as an output of a final layer of the encoder network based on the received first input;

receiving, by an attention layer coupled to the final layer of the encoder network, the multi-frequency feature maps from the final layer of the encoder network;

obtaining, by the attention layer, a plurality of embeddings from the multi-frequency feature maps;

changing, by the attention layer, a size of each embedding of the plurality of embeddings to a particular size;

calculating, by the attention layer, based on each embedding of the plurality of embeddings having the particular size, a similarity matrix that captures dependencies between the multi-frequency feature maps;

refining, by the attention layer, the multi-frequency feature maps based on the calculated similarity matrix;

receiving, by a decoder network coupled to the encoder network via the attention layer, the refined multi-frequency feature maps as a second input from the attention layer; and

generating, by the decoder network, a landmark detection result comprising a heatmap image of the object of interest based on the second input,

wherein the heatmap image indicates locations of landmark points on the object of interest in the image.