| CPC G06N 20/00 (2019.01) | 14 Claims |

|
1. A method for reinforcement learning (RL) of continuous actions for controlling physical systems, comprising:
receiving a state as input to at least one actor network to predict candidate actions based on the state, wherein the state is a current observation;
outputting the candidate actions from the at least one actor network;
receiving the state and the candidate actions as inputs to a plurality of distributional critic networks trained with independent random initializations of network parameters in parallel through interactions with an environment, wherein there is no direct interaction between different critic networks, wherein the plurality of distributional critic networks calculates quantiles of a return distribution associated with the candidate actions in relation to the state, wherein the plurality of distributional critic networks converge to same values for previously visited state-action pairs while disagreeing on novel state-action pairs;
outputting the quantiles from the plurality of distributional critic networks; and
selecting an output action based on the candidate actions and the quantiles, wherein the selecting comprises:
executing high-epistemic uncertainty actions in early training stages to accelerate exploration of optimal control parameters for a physical system; and
transitioning to low-uncertainty actions in later stages to promote convergence to optimal control policies to control the physical system.
|