| CPC G06V 20/582 (2022.01) [G06T 5/50 (2013.01); G06T 7/10 (2017.01); G06V 10/25 (2022.01); G06V 10/771 (2022.01); G06V 10/806 (2022.01); G06T 2207/10024 (2013.01); G06T 2207/10048 (2013.01); G06T 2207/20221 (2013.01); G06V 2201/07 (2022.01)] | 15 Claims |

|
1. A multimodal perception decision-making method for autonomous driving based on a large language model (LLM), comprising:
acquiring a red green blue (RGB) image and an infrared image of a target area at a current time;
processing the RGB image using a target detection model to obtain a predicted bounding box and corresponding target detection categories;
processing the infrared image, the predicted bounding box and the corresponding target detection categories by using a segmentation model to obtain a target mask image;
fusing the RGB image, the target mask image and the infrared image using a fusion model to obtain a fused feature map;
performing fusion processing on first prompt information representing a user intent, second prompt information representing target detection category priorities, and the fused feature map, using a large Vision-Language Model to obtain textual information; and
processing the textual information using a large natural language model to obtain a perception decision-making result;
wherein the segmentation model comprises an image encoder, a prompt encoder and a mask decoder; and
the step of processing the infrared image, the predicted bounding box and the corresponding target detection categories by using the segmentation model to obtain the target mask image comprises:
processing the infrared image using the image encoder to obtain image embedding features;
processing the predicted bounding box and the corresponding target detection categories using the prompt encoder to obtain prompt embedding features; and
processing the image embedding features and the prompt embedding features using the mask decoder to obtain the target mask image containing a mask and semantic labels;
wherein the fusion model comprises: a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, and an addition unit; and
the step of fusing the RGB image, the target mask image and the infrared image using the fusion model to obtain the fused feature map comprises:
processing the infrared image using the first convolutional layer to obtain a first feature map;
processing the target mask image using the second convolutional layer to obtain a second feature map;
processing the RGB image using the third convolutional layer to obtain a third feature map;
processing the first feature map, the second feature map and the third feature map using the fourth convolutional layer to obtain a fourth feature map;
processing the fourth feature map using the fifth convolutional layer to obtain a fifth feature map;
processing the fifth feature map using the sixth convolutional layer to obtain a sixth feature map;
processing the RGB image, the target mask image and the infrared image using the seventh convolutional layer to obtain a seventh feature map; and
adding together the sixth feature map and the seventh feature map using the addition unit to obtain the fused feature map.
|