US 12,380,624 B1
	Generalizable neural radiation field reconstruction method based on multi-modal information fusion
Junjie Jiang, Zhejiang (CN); Anping Wan, Zhejiang (CN); Xiaomin Cheng, Zhejiang (CN); Junhao Huang, Zhejiang (CN); and Kaiyang Wang, Zhejiang (CN)
Assigned to HANGZHOU CITY UNIVERSITY, Hangzhou (CN)
Filed by HANGZHOU CITY UNIVERSITY, Zhejiang (CN)
Filed on Feb. 28, 2025, as Appl. No. 19/066,191.
Claims priority of application No. 202411926361.9 (CN), filed on Dec. 25, 2024.
Int. Cl. G06T 15/00 (2011.01); G06T 5/70 (2024.01); G06T 7/80 (2017.01); G06V 10/44 (2022.01); G06V 10/80 (2022.01); H04N 19/597 (2014.01)

CPC G06T 15/00 (2013.01) [G06T 5/70 (2024.01); G06T 7/80 (2017.01); G06V 10/44 (2022.01); G06V 10/806 (2022.01); H04N 19/597 (2014.11)]

9 Claims

1. A generalizable neural radiation field reconstruction method based on multi-modal information fusion, comprising:

Step 1, constructing photometric features and geometric features based on unstructured multi-views, and constructing a multi-modal neural encoder by performing incrementally complementary fusion on the photometric features and the geometric features;

Step 2, converting the multi-modal neural encoder and raw Red-Green-Blue (RGB) pixel bodies of the unstructured multi-views into a volume density and radiation brightness;

Step 3, sampling light on the basis of the constructed multi-modal neural encoder, aggregating context features of the sampled light based on a transformer network to obtain light context features; and

Step 4, decoding, using the light context features, the volume density and the radiation brightness; rendering, based on the decoded volume density and the radiation brightness, to generate a free-view Red-Green-Blue-Depth (RGB-D) image; and guiding dense reconstruction of a low-texture scene by combining photometric supervision and sparse geometric supervision;

wherein, in the Step 1, constructing the photometric features comprises:

using a bi-directional fusion backbone network f^Tto extract image features;

using Convolutional Next-generation (ConvNeXt) to extract multi-scale semantic information from 4, 8, 16, and 32 times downsampling, wherein the multi-scale semantic information provides overall surface features of a region and a target;

extracting shallow localized appearance features at the 4 times downsampling;

encoding the unstructured multi-views into semantically-enhanced photometric features F_i^Tthrough bi-directional feature fusion, wherein the semantically-enhanced photometric features F_i^Tis as follows:

F_i^T=f^T(I_i);

wherein the Ii denotes the unstructured multi-views.