US 12,406,205 B2
	Systems and methods for simulating a complex reinforcement learning environment
Tze Way Eugene Ie, Los Altos, CA (US); Sanmit Santosh Narvekar, Arcadia, CA (US); and Craig Edgar Boutilier, Palo Alto, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 17, 2022, as Appl. No. 17/967,595.
Application 17/967,595 is a continuation of application No. 16/288,279, filed on Feb. 28, 2019, granted, now 11,475,355.
Claims priority of provisional application 62/801,719, filed on Feb. 6, 2019.
Prior Publication US 2023/0117499 A1, Apr. 20, 2023
Int. Cl. G06N 20/00 (2019.01); G06N 5/043 (2023.01)

CPC G06N 20/00 (2019.01) [G06N 5/043 (2013.01)]

20 Claims

1. A computing system for training a machine-learned recommender system, the computing system comprising:

one or more processors;

one or more non-transitory computer-readable media that collectively store instructions that are executable by the one or more processors to cause the computing system to perform operations, the operations comprising:

providing, to the machine-learned recommender system, an entity profile for a simulated entity, wherein the entity profile comprises topics associated with a simulated human user;

obtaining, from the machine-learned recommender system, a recommendation for a resource for consumption by the simulated entity, wherein the topics are processed by the machine-learned recommender system to generate the recommendation;

inputting, to an entity model, data descriptive of the recommended resource and data descriptive of the entity profile, wherein the entity model is configured to receive the data descriptive of the resource and output the simulated response value;

generating, by the entity model and based on the data descriptive of the recommended resource and the data descriptive of the entity profile, a simulated response value that represents a simulated response of a simulated human user, wherein the simulated response value comprises at least one of:

a view status indicating that the simulated response of the simulated human user included viewing the resource,

an interaction status indicating that the simulated response of the simulated human user included interacting with the resource using a human-machine interface device,

an interval indicating that the simulated response of the simulated human user included interacting with the resource for the interval; and

training one or more parameters of the machine-learned recommender system to increase a reward determined based on the simulated response value.