CPC G06N 5/043 (2013.01) [G06N 20/00 (2019.01)] | 20 Claims |
1. A computer-implemented method comprising:
for each agent of a multi-agent system, sampling an action with a policy of the agent based on a first state, wherein at least one agent of the multi-agent system is an implicit agent that plays against other agents of the multi-agent system by playing to minimize both an expected immediate reward for the implicit agent and an expected future reward for the implicit agent;
executing a joint action with the agents and observing a second state;
receiving an uncertain reward at each agent in response to the joint action;
storing the joint action, uncertain reward, first state, and second state in a replay buffer accessible to each agent;
for each agent, until a terminal state is reached:
sampling a random batch of samples from the replay buffer,
updating a critic of the agent by minimizing loss between a predicted version of an action-value function and an uncertain version of the action-value function, and
updating an actor of the agent, the updating to factor in the uncertain version of the action-value function.
|