US 12,354,375 B1
	Multimodal perception decision-making method and apparatus for autonomous driving based on large language model
Zhiwei Li, Beijing (CN); Tingzhen Zhang, Beijing (CN); Haohan Wu, Beijing (CN); Weizheng Zhang, Beijing (CN); Weiye Xiao, Beijing (CN); Kunfeng Wang, Beijing (CN); Wei Zhang, Beijing (CN); Tianyu Shen, Beijing (CN); Li Wang, Beijing (CN); and Qifan Tan, Beijing (CN)
Assigned to Beijing University of Chemical Technology, Beijing (CN)
Filed by Beijing University of Chemical Technology, Beijing (CN)
Filed on Jan. 17, 2025, as Appl. No. 19/026,564.
Claims priority of application No. 202410243702.6 (CN), filed on Mar. 4, 2024.
Int. Cl. G06V 20/58 (2022.01); G06T 5/50 (2006.01); G06T 7/10 (2017.01); G06V 10/25 (2022.01); G06V 10/771 (2022.01); G06V 10/80 (2022.01)

CPC G06V 20/582 (2022.01) [G06T 5/50 (2013.01); G06T 7/10 (2017.01); G06V 10/25 (2022.01); G06V 10/771 (2022.01); G06V 10/806 (2022.01); G06T 2207/10024 (2013.01); G06T 2207/10048 (2013.01); G06T 2207/20221 (2013.01); G06V 2201/07 (2022.01)]

15 Claims

1. A multimodal perception decision-making method for autonomous driving based on a large language model (LLM), comprising:

acquiring a red green blue (RGB) image and an infrared image of a target area at a current time;

processing the RGB image using a target detection model to obtain a predicted bounding box and corresponding target detection categories;

processing the infrared image, the predicted bounding box and the corresponding target detection categories by using a segmentation model to obtain a target mask image;

fusing the RGB image, the target mask image and the infrared image using a fusion model to obtain a fused feature map;

performing fusion processing on first prompt information representing a user intent, second prompt information representing target detection category priorities, and the fused feature map, using a large Vision-Language Model to obtain textual information; and

processing the textual information using a large natural language model to obtain a perception decision-making result;

wherein the segmentation model comprises an image encoder, a prompt encoder and a mask decoder; and

the step of processing the infrared image, the predicted bounding box and the corresponding target detection categories by using the segmentation model to obtain the target mask image comprises:

processing the infrared image using the image encoder to obtain image embedding features;

processing the predicted bounding box and the corresponding target detection categories using the prompt encoder to obtain prompt embedding features; and

processing the image embedding features and the prompt embedding features using the mask decoder to obtain the target mask image containing a mask and semantic labels;

wherein the fusion model comprises: a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, and an addition unit; and

the step of fusing the RGB image, the target mask image and the infrared image using the fusion model to obtain the fused feature map comprises:

processing the infrared image using the first convolutional layer to obtain a first feature map;

processing the target mask image using the second convolutional layer to obtain a second feature map;

processing the RGB image using the third convolutional layer to obtain a third feature map;

processing the first feature map, the second feature map and the third feature map using the fourth convolutional layer to obtain a fourth feature map;

processing the fourth feature map using the fifth convolutional layer to obtain a fifth feature map;

processing the fifth feature map using the sixth convolutional layer to obtain a sixth feature map;

processing the RGB image, the target mask image and the infrared image using the seventh convolutional layer to obtain a seventh feature map; and

adding together the sixth feature map and the seventh feature map using the addition unit to obtain the fused feature map.