| CPC G06N 20/00 (2019.01) [G06N 10/60 (2022.01); G06N 5/01 (2023.01); G06N 7/01 (2023.01)] | 8 Claims |

|
1. A learning device comprising:
a memory storing software code; and
a hardware processor configured to execute the software code to:
receive input of a functional form of a reward used in a reward function specifying a reward for performance of an action by an autonomous vehicle for a state of an environment of the autonomous vehicle, wherein
the state includes one or more of a map of and/or road conditions in surrounding of the autonomous vehicle, and positions and/or speeds of other vehicles in the surroundings, and
the action includes changing a path of the autonomous vehicle and a speed of the autonomous vehicle;
specify a model by which the reward function is to be learned, based on the functional form of the reward, wherein the functional form of the reward is input as a binary neural network or a Hubbard model;
learn the reward function according to the specified model, thereby learning a policy for selecting the action to be performed by the autonomous vehicle based on the state of the environment of the autonomous vehicle;
receive a state of the environment of the autonomous vehicle;
determine the action to be performed by the autonomous vehicle by applying the learned policy to the state of the environment of the autonomous vehicle; and
control the autonomous vehicle to cause the autonomous vehicle to perform the determined action, wherein
when the state indicates that there is an obstacle in front of the autonomous vehicle, the determined action is to change the path of the autonomous vehicle to avoid the obstacle, such that the learned policy provides a practical improvement in autonomous vehicle technology in that the learned policy provides for obstacle avoidance.
|