US 12,456,250 B1
	System and method for reconstructing 3D scene data from 2D image data
Yijun Wang, Auckland (NZ); Yikai Wang, Beijing (CN); and Kim Chia, Petaling Jaya (MY)
Assigned to Futureverse IP Limited, Auckland (NZ)
Filed by Futureverse IP Limited, Auckland (NZ)
Filed on Nov. 14, 2024, as Appl. No. 18/948,309.
Int. Cl. G06T 15/20 (2011.01); G06T 15/04 (2011.01); G06T 17/20 (2006.01); G06T 19/20 (2011.01)

CPC G06T 15/20 (2013.01) [G06T 15/04 (2013.01); G06T 17/20 (2013.01); G06T 19/20 (2013.01); G06T 2219/2016 (2013.01); G06T 2219/2021 (2013.01)]

13 Claims

1. A method for reconstructing a three-dimensional (3D) scene from a two-dimensional (2D) input image of the scene using a fully-differentiable transformer-based encoder-decoder network, the method comprising:

encoding the 2D input image into a set of image features using a pre-trained vision transformer model, wherein the vision transformer model is pre-trained with Multiview RGB image supervision and point cloud supervision;

projecting the set of image features onto a 3D triplane representation using a transformer decoder to obtain output triplane tokens;

upsampling and reshaping the output triplane tokens into a triplane representation;

querying the triplane representation to obtain 3D point features;

predicting, using a multi-layer perceptron, 3D point features of color and density for volumetric rendering;

representing the geometry of a 3D asset with a surface mesh including vertices and triangular faces by encoding a shape of the 3D asset as a continuous representation of its surface including generating, by a triplane transformer, a neural signed distance function (SDF) and a texture field via two latent codes to encode the shape and texture of the 3D scene, and extracting an iso-surface corresponding to the neural SDF using a differentiable version of the Marching Cubes algorithm, to thereby allow for end-to-end training of the network;

representing a texture map by with a multichannel image in UV space; and

synthesizing multiple views of the 3D scene by generating multi-view images simultaneously based on the surface mesh.