US 11,810,354 B2
	Generating digital floorplans from sparse digital video utilizing an audio-visual floorplan reconstruction machine learning model
Kristen Lorraine Grauman, Austin, TX (US); Senthil Purushwalkam Shiva Prakash, Pittsburgh, PA (US); Sebastia Vicenc Amengual Gari, Redmond, WA (US); Vamsi Krishna Ithapu, Kirkland, WA (US); Carl Schissler, Redmond, WA (US); Philip Robinson, Seattle, WA (US); and Abhinav Gupta, Pittsburgh, PA (US)
Assigned to Meta Platforms, Inc., Menlo Park, CA (US)
Filed by Meta Platforms, Inc., Menlo Park, CA (US)
Filed on Apr. 12, 2021, as Appl. No. 17/228,112.
Prior Publication US 2022/0327316 A1, Oct. 13, 2022
Int. Cl. G06V 20/40 (2022.01); G06T 11/20 (2006.01); G06N 3/08 (2023.01); G10L 25/57 (2013.01); G06F 18/25 (2023.01)

CPC G06V 20/46 (2022.01) [G06F 18/25 (2023.01); G06N 3/08 (2013.01); G06T 11/203 (2013.01); G06V 20/41 (2022.01); G10L 25/57 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

receiving a digital video depicting a viewed portion of a three-dimensional space, wherein the three-dimensional space comprises the viewed portion and an unviewed portion;

extracting visual features and audio features from a plurality of frame-audio sample pairs of the digital video by:

generating a visual feature encoding based on the visual features;

generating an audio feature encoding based on the audio features; and

generating an aligned visual feature encoding and an aligned audio feature encoding by projecting the visual feature encoding and the audio feature encoding to a two-dimensional feature grid;

generating a floorplan prediction utilizing a trained audio-visual floorplan reconstruction machine learning model from the aligned visual feature encoding and the aligned audio feature encoding generated from the plurality of frame-audio sample pairs; and

generating a two-dimensional floorplan of the viewed portion and the unviewed portion of the three-dimensional space from the floorplan prediction.