US 12,327,331 B2
	System and method for augmenting vision transformers
Akash Umakantha, Warren, MI (US); S. Alireza Golestaneh, Pittsburgh, PA (US); Joao Semedo, Pittsburgh, PA (US); and Wan-Yi Lin, Wexford, PA (US)
Assigned to Robert Bosch GmbH, Stuttgart (DE)
Filed by Robert Bosch GmbH, Stuttgart (DE)
Filed on Dec. 2, 2021, as Appl. No. 17/540,326.
Prior Publication US 2023/0177662 A1, Jun. 8, 2023
Int. Cl. G06N 3/08 (2023.01); G06N 20/20 (2019.01); G06T 5/50 (2006.01); G06T 11/00 (2006.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01)

CPC G06T 5/50 (2013.01) [G06N 20/20 (2019.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/20132 (2013.01); G06T 2207/20212 (2013.01)]

20 Claims

1. A computer-implemented method for training a machine learning system that includes a vision transformer, the computer-implemented method comprising:

obtaining a content image;

obtaining a first style image;

obtaining a second style image;

performing a first style transfer to transfer a first style from the first style image to the content image to generate a first stylized latent representation;

performing a second style transfer to transfer a second style from the second style image to the content image to generate a second stylized latent representation;

generating a first augmented image based on the first stylized latent representation;

generating a second augmented image based on the second stylized latent representation;

generating, via vision transformer, a predicted label for each of the content image, the first augmented image, and the second augmented image;

computing a loss output for the vision transformer, the loss output including a consistency loss based at least on the predicted label of each of the content image, the first augmented image, and the second augmented image to train the vision transformer to become invariant to augmentations of the first augmented image and the second augmented image; and

updating at least one parameter of the vision transformer based on the loss output.