US 12,332,613 B2
	Device and method for training a control strategy for a control device over several iterations
Felix Schmitt, Ludwigsburg (DE); and Johannes Maximilian Doellinger, Leonberg (DE)
Assigned to ROBERT BOSCH GMBH, Stuttgart (DE)
Filed by Robert Bosch GmbH, Stuttgart (DE)
Filed on Mar. 3, 2021, as Appl. No. 17/191,091.
Claims priority of application No. 102020205532.7 (DE), filed on Apr. 30, 2020.
Prior Publication US 2021/0341885 A1, Nov. 4, 2021
Int. Cl. G05B 13/02 (2006.01); G05B 13/04 (2006.01); G05D 1/00 (2024.01); G06N 3/08 (2023.01)

CPC G05B 13/027 (2013.01) [G05B 13/042 (2013.01); G05D 1/0221 (2013.01); G06N 3/08 (2013.01)]

13 Claims

1. A method for training a control strategy for a control device over several iterations, the method comprising, in each iteration:

determining an exploration strategy for finding a safe action, for a current version of the control strategy;

carrying out several simulation runs, including, for each of the simulation runs, performing:

beginning with an initial state of the simulation run, selecting an action in accordance with the exploration strategy, and checking whether the selected action is safe, the selecting and the checking being performed until a safe action has been selected or until a maximum number of actions greater than or equal to two has been selected in accordance with the exploration strategy and checked as to whether the selected action are safe,

(i) if a safe action has been selected, ascertaining a follow-up state of the state in the sequence of states by simulation during execution of the selected action,

(ii) if no safe action has been selected up to reaching of the maximum number of actions in accordance with the exploration strategy: (a) interrupting the simulation run, or (b) selecting a specified, safe action, if a specified safe action is available in the state of the sequence of states and ascertaining the follow-up state of the state by simulation during execution of the selected, specified, safe action,

wherein the initial state and the follow-up state form a sequence of states of the simulation run, and

collecting the sequence of states, including the selected actions and rewards received in the states, as data of the simulation run;

ascertaining a value of a loss function over the data of the executed simulation runs for the iteration; and

adapting the control strategy to a new version, so that the value of the loss function is reduced;

wherein, for at least one of the several simulation runs, several attempts are made to find a safe action in accordance with the exploration strategy by performing the selecting of an action in accordance with the exploration strategy and the checking as to whether the selected action is safe multiple times, up to the maximum number of actions greater than or equal to two being selected in accordance with the exploration strategy and checked as to whether the selected actions are safe, for the at least one of the several simulation runs; and

wherein an action is safe when a risk of damage or danger to an agent that uses the control device lies below a predetermined threshold value.