US 11,941,875 B2
	Processing perspective view range images using neural networks
Yuning Chai, San Mateo, CA (US); Pei Sun, Palo Alto, CA (US); Jiquan Ngiam, Mountain View, CA (US); Weiyue Wang, Sunnyvale, CA (US); Vijay Vasudevan, Los Altos Hills, CA (US); Benjamin James Caine, San Francisco, CA (US); Xiao Zhang, San Jose, CA (US); and Dragomir Anguelov, San Francisco, CA (US)
Assigned to Waymo LLC, Mountain View, CA (US)
Filed by Waymo LLC, Mountain View, CA (US)
Filed on Jul. 27, 2021, as Appl. No. 17/443,674.
Claims priority of provisional application 63/057,210, filed on Jul. 27, 2020.
Prior Publication US 2022/0044068 A1, Feb. 10, 2022
Int. Cl. G06V 20/00 (2022.01); G01S 7/48 (2006.01); G01S 17/89 (2020.01); G06F 18/21 (2023.01); G06F 18/213 (2023.01); G06F 18/25 (2023.01); G06N 3/08 (2023.01); G06T 7/70 (2017.01); G06V 10/94 (2022.01); H04N 23/10 (2023.01)

CPC G06V 20/00 (2022.01) [G01S 7/4802 (2013.01); G01S 17/89 (2013.01); G06F 18/213 (2023.01); G06F 18/217 (2023.01); G06F 18/253 (2023.01); G06N 3/08 (2013.01); G06T 3/4046 (2013.01); G06T 7/70 (2017.01); G06V 10/95 (2022.01); H04N 23/10 (2023.01); G06T 2207/20084 (2013.01)]

20 Claims

1. A method performed by one or more computers, the method comprising:

obtaining a perspective view range image generated from sensor measurements of an environment by one or more sensors, the perspective view range image comprising a plurality of pixels arranged in a two-dimensional grid and including, for each pixel, (i) features of one or more sensor measurements at a location in the environment corresponding to the pixel and (ii) geometry information comprising range features characterizing a range of the location in the environment corresponding to the pixel relative to the one or more sensors;

processing the perspective view range image using a first neural network to generate an output feature representation, wherein the first neural network comprises a first perspective point-set aggregation layer configured to:

receive an input feature map, the input feature map comprising a respective feature vector for each of a first subset of the pixels; and

generate an output feature map from the input feature map, wherein the output feature map comprises a respective output feature vector for each of the first subset of pixels, and wherein the generating comprises, for each particular pixel in the first subset, generating an initial output feature vector for the particular pixel by applying a geometry-dependent kernel to pixels within a local neighborhood of the particular pixel in the input feature map, wherein the geometry-dependent kernel depends on at least (i) respective input feature vectors for the pixels within the local neighborhood of the particular pixel in the input feature map and (ii) respective range features of the pixels within the local neighborhood of the input feature map; and

processing the output feature representation using an output neural network to generate a network output for a neural network task.