CPC G06Q 30/0202 (2013.01) [G06N 20/00 (2019.01); G06Q 10/067 (2013.01)] | 15 Claims |
1. A method for performing a simulation, the method being implemented by at least one processor in a market simulation and calibration device, the method comprising:
assigning, by the at least one processor to each respective computer agent from among a plurality of computer agents, a type value that relates to a state of the respective computer agent, such that the plurality of computer agents have differing type values, each of the differing type values indicating a different probability distribution of risk aversion and connectivity to external clients;
receiving, by the at least one processor and from a plurality of servers over a network, computer agent-specific data for the plurality of computer agents, the plurality of computer agents including computer agents of different types that behave differently, wherein the computer agent-specific data include real-world data that include a market-based observation, a market-based action and a market-based reward;
acquiring, by the at least one processor and from a network database over the network, simulator parameters;
generating, by a simulation processor of the market simulation and calibration device and providing on a display of the market simulation and calibration device, a simulation based on the assigned type values, the acquired simulator parameters and the received computer agent-specific data for each of the plurality of computer agents being in a different state and by using a shared policy that is shared by all of the plurality of computer agents, wherein the shared policy indicates a probability of a respective individual computer agent action for a corresponding state of the respective individual computer agent, wherein each of the plurality of computer agents use the same shared policy, and wherein each of the plurality of computer agents is restricted to observe only its own state and action to achieve partial observability;
acquiring actual real-world data of agent-specific data corresponding to a target locality;
performing first reinforcement learning calibration on the simulation processor of the market simulation and calibration device for performing the simulation,
wherein the first reinforcement learning calibration is performed using the actual real-world data to first constrain a shared equilibria to match a specific real-world target value,
wherein the first reinforcement learning calibration using the actual real-world data specifies a distribution of different computer agent types to correspond to the first constrained shared equilibria,
wherein the distribution of the different computer agent types is reflected on the market simulation and calibration device,
wherein the specific real-world target value for each of the plurality of computer agents is different to collectively satisfy certain constraints,
wherein the first reinforcement learning calibration includes modifying at least one of the type values assigned to the plurality of computer agents based on a result of the simulation until a calibration target is reached, and
wherein the first reinforcement learning calibration is performed by:
inputting learning rates (βmcal), (βmshared) satisfying a target condition, initial calibrator and shared policies π0Λ, π0, initial supertype profile Λ0b=Λ0 across episodes b∈[1,B],
while πmΛ, πm not converged do, wherein πmΛ, πm is a calibrator and shared policies for stage m,
for each episode b∈[1, B] do,
sample supertype increment and set δΛb˜πmΛ(·|Λm-1b) and set Λmb:=Λm-1b+δΛb,
sample multi-agent episode with supertype profile Λmb and shared policy πm, with λi˜pΛm-1b, αl(i)˜πm(·|·, λi), i∈[1, n],
update πm with learning rate βmshared based on gradient of a first equation with the episodes b∈[1,B],
update πmΛ with learning rate βmcal based on gradient to a second equation with episodes b∈[1,B],
the target condition specifies that the learning rates (βmcal), (βmshared) satisfy
![]() as well as Robbins-Monro conditions, that is their respective sum is infinite, and sum of their squares is finite,
the first equation specifies:
![]() wherein the first equation indicates that the shared policy is a Nash equilibrium of the 2-player symmetric game with payoff V, wherein a first player receives V (π1, π2) while the other receives V (π2, π1), and wherein θ1V (πθ, πθ) corresponds to trying to improve utility of the first player while keeping the second player fixed, starting from the symmetric point (πθ, πθ), and
the second equation specifies:
![]() wherein the second equation optimizes an objective of the stage m via a
calibrator's policy πΛ;
updating the shared policy for the generating of the simulation by the simulation processor of the market simulation and calibration device based on a result of the first reinforcement learning calibration including the modifying of the at least one of the type values assigned to the plurality of computer agents, wherein the updated shared policy modifies the probability of the respective individual computer agent action for the corresponding state of the respective individual computer agent based on the distribution of the different computer agent types corresponding to the first constrained shared equilibria for at least one of the plurality of computer agents;
regenerating, by the simulation processor of the market simulation and calibration device, the simulation using the updated shared policy for obtaining a different output; and
performing second reinforcement learning calibration on the simulation processor of the market simulation and calibration device based on the first reinforcement learning calibration to modify the distribution of different computer agent types differently from the distribution of different computer agent types corresponding to the first reinforcement learning calibration, and to second constrain the share equilibria to more accurately constrain the shared equilibria than the first constrain of the shared equilibria in order to more closely match the specific real-world target value corresponding to the target locality, wherein the modified distribution of the different computer agent types are updated on the market simulation and calibration device.
|