CPC G06T 7/10 (2017.01) [G06T 7/50 (2017.01); G06T 17/05 (2013.01); G06V 10/7715 (2022.01); G06V 10/7747 (2022.01); G06V 10/82 (2022.01); G06T 2207/10028 (2013.01); G06T 2207/20021 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2210/56 (2013.01)] | 20 Claims |
1. A system for bird's eye view (BEV) segmentation, comprising:
a memory storing instructions that when executed by a processor cause the processor to:
receive an input image from an image sensor on an agent, wherein the input image is a perspective space image defined relative to a position and viewing direction of the agent, wherein the image sensor is associated with intrinsic parameters, and wherein the input image includes a plurality of pixels;
extract features from the input image using a first neural network (NN), wherein a feature is a piece of information about the content of the input image;
estimate a depth map that includes depth values for pixels of the plurality of pixels of the input image;
generate a three-dimensional (3D) point map based on the depth map and the intrinsic parameters of the image sensor, wherein the 3D point map includes points corresponding to the pixels of the input image;
generate a voxel grid by voxelizing the 3D point map into a plurality of voxels, wherein voxels of the plurality of voxels include a variable number of points, the voxel grid having a size width w in X direction, height h in Y direction and depth d in Z direction in (X, Y, Z) of 3D real world coordinates;
apply the voxel grid for index referencing the input image as an indexed image, wherein features of a perspective frame based on the indexed image are combined with the voxel grid for feature fusion to predict a feature map;
generate the feature map by extracting feature vectors for pixels based on the points included in the voxels of the plurality of voxels, wherein the feature vectors include a length feature vector for each voxel of the plurality of voxels, and the feature map is represented as a 4d feature map with a size height h, width w, depth d and length I; and
generate a BEV segmentation based on the feature map.
|