US 12,248,794 B2
	Self-supervised system for learning a user interface language
Oriana Riva, Redmond, WA (US); Shweti Mahajan, Kirkland, WA (US); Pratyay Banerjee, Tempe, AZ (US); Kushal Arora, Montreal (CA); Weiwei Yang, Seattle, WA (US); Christopher Miles White, Seattle, WA (US); and Sahisnu Mazumder, Santa Clara, CA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Mar. 22, 2022, as Appl. No. 17/701,313.
Prior Publication US 2023/0305863 A1, Sep. 28, 2023
Int. Cl. G06F 9/451 (2018.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/80 (2022.01); G06V 40/40 (2022.01)

CPC G06F 9/451 (2018.02) [G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/811 (2022.01)]

20 Claims

1. A computer implemented method comprising:

accessing training data including user interface images, associated text recognized from the user interface images, and proximately located text providing instructions describing use of the user interface;

pairing each image with text captions derived from the proximately located text and image's text recognized from the user interface images;

training a vision and language model in a self-supervised manner using language masking, region masking, and image-text alignment techniques on respective image region features and tokenized text captions; and

performing fine-tuning of the vision and language model to obtain a specialized model representing user interface elements and associated functions.