US 12,443,854 B2
Reinforcement learning device and method using conditional episode configuration
Cheol-Kyun Rho, Seoul (KR); Seong-Ryeong Lee, Seoul (KR); Ye-Rin Min, Namyangju-si (KR); and Pham-Tuyen Le, Suwon-si (KR)
Assigned to AGILESODA INC., Seoul (KR)
Appl. No. 17/926,277
Filed by AGILESODA INC., Seoul (KR)
PCT Filed Aug. 21, 2020, PCT No. PCT/KR2020/011169
§ 371(c)(1), (2) Date Nov. 18, 2022,
PCT Pub. No. WO2021/235603, PCT Pub. Date Nov. 25, 2021.
Claims priority of application No. 10-2020-0061890 (KR), filed on May 22, 2020.
Prior Publication US 2023/0206079 A1, Jun. 29, 2023
Int. Cl. G06N 3/092 (2023.01)
CPC G06N 3/092 (2023.01) 9 Claims
OG exemplary drawing
 
1. A reinforcement learning device using a conditional episode configuration, the reinforcement learning device comprising:
a conditional episode configuration unit (100) configured to
extract a plurality of N(≤W) states through sampling from an arbitrary data set in which W units of state exist,
configure a condition in which an episode ends for arbitrary T(≤N) states among the extracted states,
define an episode in the condition so that a currently valued range is determined and the episode is flexibly changed when rewards are calculated,
configure a temporary episode based on the episode defined by T steps in which a condition for terminating the episode is configured, and provide the configured temporary episode to a reinforcement learning agent (200), and
automatically define and reconfigure the episode so that, when the episode ends because the condition for the state, action, and reward is not satisfied among the T steps through training of the reinforcement learning agent (200) among the steps of the temporary episode, the sum of the rewards can be maximized based on the step so far where training is performed well by satisfying the condition; and
the reinforcement learning agent (200) configured to determine an action so that the sum of rewards obtained from the T steps is maximized based on the episode input by the conditional episode configuration unit (100).