| CPC G06T 17/20 (2013.01) [G06F 3/017 (2013.01); G06V 40/28 (2022.01)] | 20 Claims |

|
1. A computing system comprising:
a processing unit configured to execute instructions to cause the computing system to estimate a set of 3D keypoints representing a 3D hand pose by:
processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints;
concatenating information from the global feature vector and the heatmap to obtain a set of input tokens;
processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view;
inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and
aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints.
|