US 12,266,190 B2
	Object identification in bird's-eye view reference frame with explicit depth estimation co-training
Albert Zhao, Saratoga, CA (US); Vasiliy Igorevich Karasev, San Francisco, CA (US); Hang Yan, Sunnyvale, CA (US); Daniel Rudolf Maurer, Mountain View, CA (US); Alper Ayvaci, San Jose, CA (US); and Yu-Han Chen, Santa Clara, CA (US)
Assigned to Waymo LLC, Mountain View, CA (US)
Filed by Waymo LLC, Mountain View, CA (US)
Filed on Aug. 9, 2022, as Appl. No. 17/884,356.
Prior Publication US 2024/0096105 A1, Mar. 21, 2024
Int. Cl. G06V 20/58 (2022.01); G06T 7/55 (2017.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01); G06T 3/4046 (2024.01); G06V 10/40 (2022.01); G06V 10/70 (2022.01); G06V 20/69 (2022.01); G06V 30/18 (2022.01)

CPC G06V 20/58 (2022.01) [G06T 7/55 (2017.01); G06V 10/44 (2022.01); G06V 10/82 (2022.01); G06T 3/4046 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30252 (2013.01); G06V 10/40 (2022.01); G06V 10/70 (2022.01); G06V 20/698 (2022.01); G06V 30/18086 (2022.01)]

20 Claims

1. A method comprising:

obtaining one or more perspective camera images of an environment;

generating, using a first neural network (NN), for each pixel of a set of pixels of the one or more perspective camera images,

a feature vector (FV), and

a depth distribution for a portion of the environment imaged by a corresponding pixel, wherein the first NN is trained using a plurality of training images and a depth ground truth data for the plurality of training images;

obtaining, for each pixel of the set of pixels, a feature tensor (FT) in view of (i) the FV for a respective pixel and (ii) the depth distribution for the respective pixel; and

processing the obtained FTs, using a second NN, to identify one or more objects in the environment.