| CPC G06N 3/092 (2023.01) | 9 Claims |

|
1. A reinforcement learning device using a conditional episode configuration, the reinforcement learning device comprising:
a conditional episode configuration unit (100) configured to
extract a plurality of N(≤W) states through sampling from an arbitrary data set in which W units of state exist,
configure a condition in which an episode ends for arbitrary T(≤N) states among the extracted states,
define an episode in the condition so that a currently valued range is determined and the episode is flexibly changed when rewards are calculated,
configure a temporary episode based on the episode defined by T steps in which a condition for terminating the episode is configured, and provide the configured temporary episode to a reinforcement learning agent (200), and
automatically define and reconfigure the episode so that, when the episode ends because the condition for the state, action, and reward is not satisfied among the T steps through training of the reinforcement learning agent (200) among the steps of the temporary episode, the sum of the rewards can be maximized based on the step so far where training is performed well by satisfying the condition; and
the reinforcement learning agent (200) configured to determine an action so that the sum of rewards obtained from the T steps is maximized based on the episode input by the conditional episode configuration unit (100).
|