US 12,347,030 B1
	Monocular video-based three-dimensional reconstruction and depth prediction method and system for pipeline
Ce Li, Beijing (CN); Qin Han, Beijing (CN); Feng Yang, Beijing (CN); and Suping Peng, Beijing (CN)
Assigned to CHINA UNIVERSITY OF MINING AND TECHNOLOGY, BEIJING, Beijing (CN)
Filed by CHINA UNIVERSITY OF MINING AND TECHNOLOGY, BEIJING, Beijing (CN)
Filed on Jun. 4, 2024, as Appl. No. 18/733,585.
Claims priority of application No. 202410129446.8 (CN), filed on Jan. 31, 2024.
Int. Cl. G06T 17/00 (2006.01); G06T 7/80 (2017.01); G06V 10/44 (2022.01); G06V 10/46 (2022.01); G06V 10/771 (2022.01)

CPC G06T 17/00 (2013.01) [G06T 7/80 (2017.01); G06V 10/44 (2022.01); G06V 10/462 (2022.01); G06V 10/771 (2022.01); G06T 2207/20081 (2013.01)]

8 Claims

1. A monocular video-based three-dimensional reconstruction and depth prediction method for a pipeline, comprising:

calibrating a depth camera using Zhang's calibration method to obtain camera internal parameters, and collecting monocular videos in different pipeline scenes using a pipeline robot equipped with the depth camera;

applying a COLMAP method to perform feature extraction and matching and incremental reconstruction on a pipeline scene image sequence composed of each monocular video to obtain camera external parameters corresponding to images in each pipeline scene image sequence, and constructing a pipeline three-dimensional reconstruction dataset according to the camera internal parameters, the camera external parameters corresponding to the images in each pipeline scene image sequence, the pipeline scene image sequence and a corresponding real depth map;

training a Fast-MVSNet network and a PatchMatchNet network using public datasets to obtain a plurality of trained Fast-MVSNet network models and a plurality of trained PatchMatchNet network models, wherein the public datasets comprise a DTU dataset, a BlendedMVS dataset, and an ETH3D dataset;

inputting the pipeline three-dimensional reconstruction dataset into all the trained Fast-MVSNet network models and all the trained PatchMatchNet network models, and evaluating all the trained network models to obtain an optimal network model, wherein the trained network models comprise all the trained Fast-MVSNet network models and all the trained PatchMatchNet network models; and

performing three-dimensional reconstruction and depth prediction on a pipeline three-dimensional reconstruction dataset to be processed using the optimal network model;

wherein the applying a COLMAP method to perform feature extraction and matching and incremental reconstruction on a pipeline scene image sequence composed of each monocular video to obtain camera external parameters corresponding to images in each pipeline scene image sequence, and constructing a pipeline three-dimensional reconstruction dataset according to the camera internal parameters, the camera external parameters corresponding to the images in each pipeline scene image sequence, the pipeline scene image sequence and a corresponding real depth map comprise:

performing Scale Invariant Feature Transform (SIFT) feature extraction and feature matching on each pipeline scene image sequence, and eliminating false matching point pairs in each matched image pair using an epipolar geometry relationship;

sorting all images according to a number of matching points for each pipeline scene image sequence, with an image with a largest number of matching points as a starting point, and selecting two images with a number of matching point pairs between the two images greater than a first predetermined value and a translation vector between the two images greater than a second predetermined value as an initial image pair; and setting camera external parameters of a first image in the initial image pair as an identity matrix;

calculating camera external parameters of a second image in the initial image pair using the epipolar geometry relationship according to a 2D-2D matching relationship of the initial image pair and the identity matrix, wherein the 2D-2D matching relationship refers to a SIFT feature matching relationship;

triangulating the initial image pair to generate initial pipeline scene three-dimensional points in the pipeline three-dimensional scene;

selecting an optimal image from currently unselected images as a current newly added image according to a number of points with a corresponding relationship between two-dimensional points in the image and currently constructed pipeline scene three-dimensional points and a distribution of the corresponding two-dimensional points in the images, wherein the corresponding two-dimensional points refer to two-dimensional points, which have the corresponding relationship with the currently constructed pipeline scene three-dimensional points, in the image;

obtaining a 2D-3D matching relationship between the current newly added image and each target image according to a 2D-2D matching relationship between the current newly added image and currently selected images, and calculating camera external parameters of the current newly added image using a Random Sample Consensus (RANSAC)-Perspective-n-Point (PnP) method according to the 2D-3D matching relationship and the camera internal parameters, wherein the target image is an image having the 2D-2D matching relationship with the current newly added image among the currently selected images;

triangulating the current newly added image and the target image to generate a newly added pipeline scene three-dimensional point in the pipeline three-dimensional scene; and performing Bundle Adjustment (BA) optimization on the camera external parameters of the currently selected images and the currently constructed pipeline scene three-dimensional points to obtain optimized three-dimensional points and optimized camera external parameters corresponding to the currently selected images;

returning to execute the step of “selecting an optimal image from currently unselected images as a current newly added image according to a number of points with a corresponding relationship between two-dimensional points in the image and currently constructed pipeline scene three-dimensional points and a distribution of the corresponding two-dimensional points in the images” until a newly added image is unable to be selected; and

constructing the pipeline three-dimensional reconstruction dataset using the initial image pair and all the newly added images, wherein the pipeline three-dimensional reconstruction dataset comprises all selected images, and a real depth map, camera internal and external parameters, depth information of each pixel point and optimal source view serial number corresponding to each selected image; the selected images comprise the initial image pair and all the newly added images; the real depth map corresponding to each selected image is obtained using the depth camera; the optimal source view serial number corresponding to each selected image is a frame serial number of a plurality of images closest to a frame serial number of the selected image.