US 12,217,519 B2
	Systems and methods for uncertainty aware monocular 3D object detection
Rares Andrei Ambrus, San Francisco, CA (US); Or Litany, Sunnyvale, CA (US); Vitor Guizilini, Santa Clara, CA (US); Leonidas Guibas, Palo Alto, CA (US); Adrien David Gaidon, Mountain View, CA (US); and Jie Li, San Jose, CA (US)
Assigned to TOYOTA RESEARCH INSTITUTE, INC., Los Altos, CA (US); and THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY, Stanford, CA (US)
Filed by TOYOTA RESEARCH INSTITUTE, INC.; and THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY, Stanford, CA (US)
Filed on Dec. 6, 2021, as Appl. No. 17/543,144.
Prior Publication US 2023/0177850 A1, Jun. 8, 2023
Int. Cl. G06N 3/08 (2023.01); G06T 7/20 (2017.01); G06V 20/56 (2022.01); G06V 20/64 (2022.01)

CPC G06V 20/64 (2022.01) [G06N 3/08 (2013.01); G06T 7/20 (2013.01); G06V 20/56 (2022.01); G06T 2207/30241 (2013.01)]

20 Claims

1. A method for uncertainty aware 3D object detection, comprising:

predicting, using a trained monocular depth network, an estimated monocular input depth map of a monocular image of a video stream and an estimated depth uncertainty map associated with the estimated monocular input depth map;

feeding back a depth uncertainty regression loss associated with the estimated monocular input depth map and a ground truth depth map during training of the trained monocular depth network to update the estimated monocular input depth map to form an output monocular depth map;

updating the output monocular depth map using a vote regression loss from a 3D object detection network based on an aggregated depth uncertainty map corresponding to the estimated depth uncertainty map and the output monocular depth map;

detecting, by the 3D object detection network, 3D objects from a 3D point cloud computed from the updated output monocular depth map based on seed positions selected from the 3D point cloud and the aggregated depth uncertainty map; and

selecting, by the 3D object detection network, 3D bounding boxes of the 3D objects detected from the 3D point cloud based on the seed positions and refined, predicted votes based on the aggregated depth uncertainty map.