US 12,226,919 B2
	Device and method for training a machine learning model to derive a movement vector for a robot from image data
Oren Spector, Modiin Maccabim Reut (IL); Dotan Di Castro, Haifa (IL); and Vladimir Tchuiev, Haifa (IL)
Assigned to ROBERT BOSCH GMBH, Stuttgart (DE)
Filed by Robert Bosch GmbH, Stuttgart (DE)
Filed on Feb. 27, 2023, as Appl. No. 18/174,803.
Claims priority of application No. 10 2022 202 142.8 (DE), filed on Mar. 2, 2022.
Prior Publication US 2023/0278227 A1, Sep. 7, 2023
Int. Cl. G06T 7/70 (2017.01); B25J 9/16 (2006.01)

CPC B25J 9/1697 (2013.01) [G06T 7/70 (2017.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)]

8 Claims

1. A method for training a machine learning model to derive a movement vector for a robot from image data, comprising the following steps:

acquiring images from a perspective of an end-effector of the robot;

forming training image data elements from the acquired images;

generating augmentations of the training image data elements to form augmented image data elements;

training an encoder network by a first training set of image data elements including the training image data elements and the augmented image data elements using contrastive loss, wherein, for each image data element of the first training set, another image data element is a positive sample when the other image data element is an augmentation of the image data element, and is a negative sample otherwise; and

training a neural network from training data elements of a second training set, wherein each training data element of the second training set includes at least one respective image data element and a respective ground truth movement vector, wherein the training is by feeding, for each training data element, the at least one image data element to the trained encoder network, inputting an embedding output provided by the encoder network in response to the at least one image data element, and training the neural network to reduce a loss between movement vectors output by the neural network in response to the embedding outputs for the training data elements and the respective ground truth movement vectors for the training data elements.