US 12,481,702 B2
Fast exploration and learning of latent graph models
Sivaramakrishnan Swaminathan, Mountain View, CA (US); Meet Kirankumar Dave, Santa Clara, CA (US); Miguel Lazaro-Gredilla, Union City, CA (US); and Dileep George, Sunnyvale, CA (US)
Assigned to GDM Holding LLC, Mountain View, CA (US)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Nov. 25, 2024, as Appl. No. 18/959,423.
Application 18/959,423 is a continuation of application No. 18/373,870, filed on Sep. 27, 2023, granted, now 12,189,688.
Claims priority of provisional application 63/436,845, filed on Jan. 3, 2023.
Claims priority of provisional application 63/411,031, filed on Sep. 28, 2022.
Prior Publication US 2025/0165532 A1, May 22, 2025
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/901 (2019.01); G06F 17/12 (2006.01)
CPC G06F 16/9024 (2019.01) [G06F 17/12 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method performed by one or more computers, wherein the method comprises:
obtaining experience data generated as a result of controlling an agent in an environment to perform a sequence of one or more actions from a possible set of actions, each action being performed in response to receiving a respective observation characterizing a respective state of the environment;
using the experience data to update a visitation count for each of one or more state-action pairs, wherein each state-action pair represents a respective state of the environment and a respective action performed by the agent; and
performing an environment exploration step to explore the environment, the performing comprising:
computing a utility measure for each of the one or more state-action pairs, wherein computing the utility measure comprises evaluating a closed form utility function using at least the updated visitation counts;
determining, based on the utility measures, a sequence of one or more planned actions that have an information gain that satisfies a threshold; and
controlling the agent to perform the sequence of one or more planned actions to cause the environment to transition from a state characterized by a last observation received after a last action in the experience data into a different state.