US 12,080,010 B2
	Self-supervised multi-frame monocular depth estimation model
James Watson, London (GB); Oisin MacAodha, Edinburgh (GB); Victor Adrian Prisacariu, London (GB); Gabriel J. Brostow, London (GB); and Michael David Firman, London (GB)
Assigned to NIANTIC, INC., San Francisco, CA (US)
Filed by Niantic, Inc., San Francisco, CA (US)
Filed on Dec. 8, 2021, as Appl. No. 17/545,201.
Claims priority of provisional application 63/124,757, filed on Dec. 12, 2020.
Prior Publication US 2022/0189049 A1, Jun. 16, 2022
Int. Cl. G06T 7/55 (2017.01); G01B 11/22 (2006.01); G06T 3/18 (2024.01); G06T 7/73 (2017.01); G06T 11/00 (2006.01)

CPC G06T 7/55 (2017.01) [G01B 11/22 (2013.01); G06T 3/18 (2024.01); G06T 7/73 (2017.01); G06T 11/00 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)]

14 Claims

1. A computer-implemented method comprising:

receiving a time series of images of a scene including a primary image and an additional image from an earlier time than the primary image, wherein the time series of images are monocular images derived from monocular video;

inputting the time series of images into a depth estimation model;

receiving, as output from the depth estimation model, a depth map of the primary image, the depth map generated based on a cost volume concatenating differences between a primary feature map of the primary image and a plurality of warped feature maps of the additional image for each of a plurality of depth planes, wherein receiving the depth map as output from the depth estimation model comprises;

generating a primary feature map for the primary image and an additional feature map for the additional image;

generating a warped feature map comprising a plurality of warped feature map layers, each warped feature map layer generated by warping the additional feature map to a plurality of depth planes based on (1) a depth plane of the plurality to which the feature map is being warped, (2) a relative pose between the primary image and the additional image, and (3) intrinsics of a camera used to capture the primary image and the additional image;

for each warped feature map layer, calculating a difference between the warped feature map layer and the primary feature map; and

building the cost volume by concatenating the differences between layers of the warped feature map and the primary feature map;

wherein the output is based on the cost volume and the primary feature map;

generating virtual content using the depth map; and

displaying an image of scene augmented with the virtual content.