US 12,266,160 B2
Masked autoencoders for computer vision
Kaiming He, Palo Alto, CA (US); Piotr Dollar, San Mateo, CA (US); Ross Girshick, Seattle, WA (US); Saining Xie, Sunnyvale, CA (US); Xinlei Chen, Belmont, CA (US); and Yanghao Li, Sunnyvale, CA (US)
Assigned to Meta Platforms, Inc., Menlo Park, CA (US)
Filed by Meta Platforms, Inc., Menlo Park, CA (US)
Filed on Jul. 27, 2022, as Appl. No. 17/875,210.
Prior Publication US 2024/0096072 A1, Mar. 21, 2024
Int. Cl. G06V 10/778 (2022.01); G06V 10/26 (2022.01); G06V 10/75 (2022.01); G06V 10/774 (2022.01)
CPC G06V 10/778 (2022.01) [G06V 10/26 (2022.01); G06V 10/751 (2022.01); G06V 10/774 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method, implemented by a computing system, comprising:
accessing a plurality of images to facilitate pre-training a first machine-learning model comprising an encoder and a decoder; and
using the plurality of images to pre-train the first machine-learning model by:
dividing at least one image, of the plurality of images, into a set of patches;
selecting a first subset of the patches to be visible and a second subset of the patches to be masked during the pre-training;
processing, using the encoder, the first subset of patches and corresponding first positional encodings to generate corresponding first latent representations;
processing, using the decoder, the first latent representations corresponding to the first subset of patches and mask tokens corresponding to the second subset of patches to generate reconstructed patches corresponding to the second subset of patches, wherein the reconstructed patches and the first subset of patches are used to generate a reconstructed image; and
updating the first machine-learning model based on comparisons between the at least one image and the reconstructed image.