US 11,967,103 B2
	Multi-modal 3-D pose estimation
Jingxiao Zheng, San Jose, CA (US); Xinwei Shi, Cupertino, CA (US); Alexander Gorban, Scotts Valley, CA (US); Junhua Mao, Palo Alto, CA (US); Andre Liang Cornman, San Francisco, CA (US); Yang Song, San Jose, CA (US); Ting Liu, Los Angeles, CA (US); Ruizhongtai Qi, Mountain View, CA (US); Yin Zhou, San Jose, CA (US); Congcong Li, Cupertino, CA (US); and Dragomir Anguelov, San Francisco, CA (US)
Assigned to Waymo LLC, Mountain View, CA (US)
Filed by Waymo LLC, Mountain View, CA (US)
Filed on Oct. 20, 2021, as Appl. No. 17/505,900.
Claims priority of provisional application 63/114,448, filed on Nov. 16, 2020.
Prior Publication US 2022/0156965 A1, May 19, 2022
Int. Cl. G06T 7/73 (2017.01); G06F 18/214 (2023.01); G06F 18/25 (2023.01); G06V 20/58 (2022.01)

CPC G06T 7/73 (2017.01) [G06F 18/214 (2023.01); G06F 18/251 (2023.01); G06V 20/58 (2022.01); G06T 2207/10028 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30196 (2013.01); G06T 2207/30261 (2013.01)]

21 Claims

1. A method comprising:

obtaining an image of an environment;

obtaining a point cloud of a three-dimensional region of the environment;

generating a fused representation of the image and the point cloud, comprising:

processing the image to generate, for each of a plurality of keypoints, a respective score of each of a plurality of locations in the image, wherein the respective score represents a likelihood that the keypoint is located at the location in the image;

determining, for each of a plurality of data points in the point cloud, a corresponding location in the image that corresponds to the data point;

generating, for each of the plurality of data points in the point cloud, a respective feature vector, wherein the respective feature vector for the data point includes, for each of the plurality of keypoints, the respective score of the corresponding location in the image that corresponds to the data point; and

generating the fused representation from the respective feature vectors; and

processing the fused representation using a pose estimation neural network to generate a pose estimation network output that specifies, for each of the plurality of keypoints, a respective estimated position in the three-dimensional region of the environment.