US 12,437,230 B2
	Systems and methods for risk-sensitive reinforcement learning
Nelson Vadori, New York, NY (US); Sumitra Ganesh, Short Hills, NJ (US); and Maria Manuela Veloso, Pittsburgh, PA (US)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed on Jan. 21, 2021, as Appl. No. 17/154,825.
Claims priority of provisional application 62/965,428, filed on Jan. 24, 2020.
Prior Publication US 2021/0232970 A1, Jul. 29, 2021
Int. Cl. G06N 20/00 (2019.01); G06Q 40/06 (2012.01)

CPC G06N 20/00 (2019.01) [G06Q 40/06 (2013.01)]

14 Claims

1. A method, comprising:

a risk-sensitive learning engine comprising at least one computer processor receiving, from a data source, a plurality of sets of training data for a plurality of time steps, each set of training data comprising an initial state, an action comprising a trade, a reward, and a state at the next time step;

the risk-sensitive learning engine generating a correction factor for a risk sensitive policy using a Q-learning process by:

initializing a Q table;

receiving a training budget comprising a plurality of episodes, a risk aversion coefficient, and an end state; and

for each episode in the episode in the training budget:

setting the end state for the episode;

setting a time to zero and a state to an initial state;

executing the action for time t and monitoring a reward at t+1 and a state at time t+1, wherein the reward at t+1 and the state at t+1 are results of the action at time t;

the risk-sensitive learning engine increasing time t by one;

the risk-sensitive learning engine executing the action for time t and monitoring a reward at time t+1 and a state at time t+1, wherein the reward at t+1 and the state at t+1 are results of the action at time t;

the risk-sensitive learning engine calculating an average reward over time t;

the risk-sensitive learning engine calculating the correction factor based on the reward at time t+1, the average reward over time, and the risk aversion coefficient, wherein the correction factor minimizes stochasticity for the reward based on the risk aversion coefficient; and

repeating the steps of increasing time t by one, executing the action for time t+1 and monitoring the reward at time t+1 and a state at time t+1, calculating the average reward over time t, and calculating the correction factor based on the reward at time t+1 and the risk aversion coefficient until the end state is met;

the risk-sensitive learning engine outputting a trained risk-sensitive policy function with the correction factor to a risk engine;

the risk engine receiving real-time data from the data source;

the risk engine applying the trained risk-sensitive policy function with the correction factor to the real-time data;

the risk engine executes an action based on an output of the applying.