US 12,299,916 B2
	Three-dimensional location prediction from images
Longlong Jing, Mountain View, CA (US); Ruichi Yu, Mountain View, CA (US); Jiyang Gao, Foster City, CA (US); Henrik Kretzschmar, Mountain View, CA (US); Kang Li, Sammamish, WA (US); Ruizhongtai Qi, Mountain View, CA (US); Hang Zhao, Sunnyvale, CA (US); Alper Ayvaci, Santa Clara, CA (US); Xu Chen, Livermore, CA (US); Dillon Cower, Woodinville, WA (US); and Congcong Li, Cupertino, CA (US)
Assigned to Waymo LLC, Mountain View, CA (US)
Filed by Waymo LLC, Mountain View, CA (US)
Filed on Dec. 8, 2021, as Appl. No. 17/545,987.
Claims priority of provisional application 63/122,899, filed on Dec. 8, 2020.
Prior Publication US 2022/0180549 A1, Jun. 9, 2022
Int. Cl. G06T 7/50 (2017.01); G06T 7/70 (2017.01); G06V 10/40 (2022.01); G06V 10/80 (2022.01); G06N 20/00 (2019.01)

CPC G06T 7/70 (2017.01) [G06T 7/50 (2017.01); G06V 10/40 (2022.01); G06V 10/806 (2022.01); G06N 20/00 (2019.01); G06T 2207/10016 (2013.01); G06T 2207/20084 (2013.01)]

20 Claims

1. A method performed by one or more computers, the method comprising:

obtaining a temporal sequence of images that comprises, at each of a plurality of time steps, a respective image that was captured by a camera at the time step;

generating, for each image in the temporal sequence, respective pseudo-lidar features of a respective pseudo-lidar representation of a region in the image that has been determined to depict a first object by processing the region in the image using a first neural network, wherein the pseudo-lidar features represent one or more pixels within the region in the image as a point in a three-dimensional coordinate system based on an initial depth estimate for the image;

generating, for a particular image at a particular time step in the temporal sequence, image patch features of the region in the particular image that has been determined to depict the first object by processing the region in the particular image using a second neural network, wherein the image patch features are generated from intensity values of pixels in the image; and

generating, from the respective pseudo-lidar features and the image patch features, a prediction that characterizes a location of the first object in the three-dimensional coordinate system at the particular time step in the temporal sequence by processing the respective pseudo-lidar features and the image patch features using a third neural network, wherein generating, from the respective pseudo-lidar features and the image patch features, a prediction that characterizes the first object at the particular time step in the temporal sequence comprises:

combining the respective pseudo-lidar features that represent one or more pixels within the region in the image as a point in the three-dimensional coordinate system based on the initial depth estimate for the image and the image patch features to generate combined features; and

processing the combined features using the third neural network to generate the prediction.