CPC G06F 9/451 (2018.02) [G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/811 (2022.01)] | 20 Claims |
1. A computer implemented method comprising:
accessing training data including user interface images, associated text recognized from the user interface images, and proximately located text providing instructions describing use of the user interface;
pairing each image with text captions derived from the proximately located text and image's text recognized from the user interface images;
training a vision and language model in a self-supervised manner using language masking, region masking, and image-text alignment techniques on respective image region features and tokenized text captions; and
performing fine-tuning of the vision and language model to obtain a specialized model representing user interface elements and associated functions.
|