| CPC G06V 40/169 (2022.01) [G06N 3/045 (2023.01); G06V 10/42 (2022.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01); G06V 40/171 (2022.01)] | 20 Claims |

|
14. A method, comprising:
receiving, by an encoder network, a first input comprising an image of an object of interest;
generating, by the encoder network, multi-frequency feature maps as an output of a final layer of the encoder network based on the received first input;
receiving, by an attention layer coupled to the final layer of the encoder network, the multi-frequency feature maps from the final layer of the encoder network;
obtaining, by the attention layer, a plurality of embeddings from the multi-frequency feature maps;
changing, by the attention layer, a size of each embedding of the plurality of embeddings to a particular size;
calculating, by the attention layer, based on each embedding of the plurality of embeddings having the particular size, a similarity matrix that captures dependencies between the multi-frequency feature maps;
refining, by the attention layer, the multi-frequency feature maps based on the calculated similarity matrix;
receiving, by a decoder network coupled to the encoder network via the attention layer, the refined multi-frequency feature maps as a second input from the attention layer; and
generating, by the decoder network, a landmark detection result comprising a heatmap image of the object of interest based on the second input,
wherein the heatmap image indicates locations of landmark points on the object of interest in the image.
|