US 12,073,563 B2
	Systems and methods for birds eye view segmentation
Isht Dwivedi, Mountain View, CA (US); Yi-Ting Chen, Hsinchu (TW); and Behzad Dariush, San Ramon, CA (US)
Assigned to HONDA MOTOR CO., LTD., Tokyo (JP)
Filed by Honda Motor Co., Ltd., Tokyo (JP)
Filed on Mar. 31, 2022, as Appl. No. 17/710,807.
Claims priority of provisional application 63/215,259, filed on Jun. 25, 2021.
Prior Publication US 2022/0414887 A1, Dec. 29, 2022
Int. Cl. G06T 7/10 (2017.01); G06T 7/50 (2017.01); G06T 17/05 (2011.01); G06V 10/77 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01)

CPC G06T 7/10 (2017.01) [G06T 7/50 (2017.01); G06T 17/05 (2013.01); G06V 10/7715 (2022.01); G06V 10/7747 (2022.01); G06V 10/82 (2022.01); G06T 2207/10028 (2013.01); G06T 2207/20021 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2210/56 (2013.01)]

20 Claims

1. A system for bird's eye view (BEV) segmentation, comprising:

a memory storing instructions that when executed by a processor cause the processor to:

receive an input image from an image sensor on an agent, wherein the input image is a perspective space image defined relative to a position and viewing direction of the agent, wherein the image sensor is associated with intrinsic parameters, and wherein the input image includes a plurality of pixels;

extract features from the input image using a first neural network (NN), wherein a feature is a piece of information about the content of the input image;

estimate a depth map that includes depth values for pixels of the plurality of pixels of the input image;

generate a three-dimensional (3D) point map based on the depth map and the intrinsic parameters of the image sensor, wherein the 3D point map includes points corresponding to the pixels of the input image;

generate a voxel grid by voxelizing the 3D point map into a plurality of voxels, wherein voxels of the plurality of voxels include a variable number of points, the voxel grid having a size width w in X direction, height h in Y direction and depth d in Z direction in (X, Y, Z) of 3D real world coordinates;

apply the voxel grid for index referencing the input image as an indexed image, wherein features of a perspective frame based on the indexed image are combined with the voxel grid for feature fusion to predict a feature map;

generate the feature map by extracting feature vectors for pixels based on the points included in the voxels of the plurality of voxels, wherein the feature vectors include a length feature vector for each voxel of the plurality of voxels, and the feature map is represented as a 4d feature map with a size height h, width w, depth d and length I; and

generate a BEV segmentation based on the feature map.