US 12,354,402 B2
Landmark detection using deep neural network with multi-frequency self-attention
Pankaj Wasnik, Bangalore (IN); Aman Shenoy, Bangalore (IN); Naoyuki Onoe, Bangalore (IN); and Janani Ramaswamy, Bangalore (IN)
Assigned to SONY GROUP CORPORATION, Tokyo (JP)
Filed by SONY GROUP CORPORATION, Tokyo (JP)
Filed on Jan. 6, 2022, as Appl. No. 17/569,778.
Claims priority of provisional application 63/211,127, filed on Jun. 16, 2021.
Prior Publication US 2022/0406091 A1, Dec. 22, 2022
Int. Cl. G06V 40/16 (2022.01); G06N 3/045 (2023.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01)
CPC G06V 40/169 (2022.01) [G06N 3/045 (2023.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01); G06V 40/171 (2022.01)] 20 Claims
OG exemplary drawing
 
14. A method, comprising:
receiving, by an encoder network, a first input comprising an image of an object of interest;
generating, by the encoder network, multi-frequency feature maps as an output of a final layer of the encoder network based on the received first input;
receiving, by an attention layer coupled to the final layer of the encoder network, the multi-frequency feature maps from the final layer of the encoder network;
obtaining, by the attention layer, a plurality of embeddings from the multi-frequency feature maps;
changing, by the attention layer, a size of each embedding of the plurality of embeddings to a particular size;
calculating, by the attention layer, based on each embedding of the plurality of embeddings having the particular size, a similarity matrix that captures dependencies between the multi-frequency feature maps;
refining, by the attention layer, the multi-frequency feature maps based on the calculated similarity matrix;
receiving, by a decoder network coupled to the encoder network via the attention layer, the refined multi-frequency feature maps as a second input from the attention layer; and
generating, by the decoder network, a landmark detection result comprising a heatmap image of the object of interest based on the second input,
wherein the heatmap image indicates locations of landmark points on the object of interest in the image.