US 12,189,688 B2
	Fast exploration and learning of latent graph models
Sivaramakrishnan Swaminathan, Mountain View, CA (US); Meet Kirankumar Dave, Santa Clara, CA (US); Miguel Lazaro-Gredilla, Union City, CA (US); and Dileep George, Sunnyvale, CA (US)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Sep. 27, 2023, as Appl. No. 18/373,870.
Claims priority of provisional application 63/436,845, filed on Jan. 3, 2023.
Claims priority of provisional application 63/411,031, filed on Sep. 28, 2022.
Prior Publication US 2024/0126812 A1, Apr. 18, 2024
Int. Cl. G06F 16/901 (2019.01); G06F 17/12 (2006.01)

CPC G06F 16/9024 (2019.01) [G06F 17/12 (2013.01)]

20 Claims

1. A method of generating a graph model representing an environment being interacted with by an agent, wherein the graph model comprises nodes that represent states of the environment and edges connecting the nodes, wherein an edge between a first node and a second node in the graph model represents a corresponding action performed by the agent which caused the environment to transition from a state represented by the first node into a state represented by the second node, wherein the method comprises:

obtaining experience data generated as a result of controlling the agent to perform a sequence of one or more actions from a possible set of actions, each action being performed in response to receiving a respective observation characterizing a respective state of the environment;

using the experience data to update a visitation count for each of one or more state-action pairs represented by the graph model, wherein each state-action pair corresponds to a node and an outgoing edge of the node included in the graph model; and

at each of multiple environment exploration steps:

computing a utility measure for each of the one or more state-action pairs represented by the graph model, wherein computing the utility measure comprises evaluating a closed form utility function using at least the updated visitation counts;

determining, based on the utility measures, a sequence of one or more planned actions that have an information gain that satisfies a threshold; and

controlling the agent to perform the sequence of one or more planned actions to cause the environment to transition from a state characterized by a last observation received after a last action in the experience data into a different state.