| CPC G06F 16/9024 (2019.01) [G06F 17/12 (2013.01)] | 20 Claims |

|
1. A method performed by one or more computers, wherein the method comprises:
obtaining experience data generated as a result of controlling an agent in an environment to perform a sequence of one or more actions from a possible set of actions, each action being performed in response to receiving a respective observation characterizing a respective state of the environment;
using the experience data to update a visitation count for each of one or more state-action pairs, wherein each state-action pair represents a respective state of the environment and a respective action performed by the agent; and
performing an environment exploration step to explore the environment, the performing comprising:
computing a utility measure for each of the one or more state-action pairs, wherein computing the utility measure comprises evaluating a closed form utility function using at least the updated visitation counts;
determining, based on the utility measures, a sequence of one or more planned actions that have an information gain that satisfies a threshold; and
controlling the agent to perform the sequence of one or more planned actions to cause the environment to transition from a state characterized by a last observation received after a last action in the experience data into a different state.
|