| CPC G06N 20/00 (2019.01) [G06N 5/04 (2013.01)] | 15 Claims |

|
1. A computer-implemented method for improving a machine operation, the method comprising:
receiving a plurality of domain specific heuristics, a set of states, and a set of actions, where an immediate cost and/or reward is associated with a pair of state and action, the domain specific heuristics including at least one of cruncher heuristics and feasibility heuristics;
generating at least one of:
a graph of state transitions for the actions, and
a transition probability matrix;
executing a Markov Decision Process (MDP) model for solving an MDP problem, and outputting an MDP optimal policy of an optimal mapping of a given state to an action, wherein the executing of the MDP includes executing an objective function for a plurality of MDP execution iterations until meeting an MDP stopping condition that includes no change in the results of input parameters of the domain specific heuristics;
using the MDP optimal policy as input to pre-train a reinforcement learning (RL) model and/or deep RL (DRL) model, wherein the MDP optimal policy reduces an amount of time required to pre-train the RL model and/or DRL model;
executing the pre-trained RL model and/or DRL model to obtain a recommended policy;
selecting one of the plurality of domain specific heuristics and heuristic input parameters of the recommended policy;
controlling the machine for solving a predefined optimization problem in a plurality of execution iterations, the predefined optimization problem selected from the group consisting of a mixed integer linear programming problem and a mixed integer programming problem, wherein each execution iteration includes:
using an outputted RL and/or DRL recommended action for a current state which includes the selected domain specific heuristic and its input parameters;
receiving a result for the optimization problem and calculating a next state;
upon the RL and/or DRL model determining that a predefined stopping condition is met, stopping the execution iterations of the machine for solving the predefined optimization problem;
upon the RL and/or DRL model determining that the predefined stopping condition has not been met, inputting the next state to the RL and/or DRL model to receive an optimal action for the next state for a next iteration of the machine for solving the predefined optimization problem.
|