US 12,346,777 B2
	Optimizing a machine for solving offline optimization problems
Alexander Zadorojniy, Haifa (IL); and Vladimir Lipets, Haifa (IL)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 25, 2020, as Appl. No. 17/032,142.
Prior Publication US 2022/0101177 A1, Mar. 31, 2022
Int. Cl. G06N 20/00 (2019.01); G06N 5/01 (2023.01); G06N 5/04 (2023.01); G06N 7/01 (2023.01)

CPC G06N 20/00 (2019.01) [G06N 5/04 (2013.01)]

15 Claims

1. A computer-implemented method for improving a machine operation, the method comprising:

receiving a plurality of domain specific heuristics, a set of states, and a set of actions, where an immediate cost and/or reward is associated with a pair of state and action, the domain specific heuristics including at least one of cruncher heuristics and feasibility heuristics;

generating at least one of:

a graph of state transitions for the actions, and

a transition probability matrix;

executing a Markov Decision Process (MDP) model for solving an MDP problem, and outputting an MDP optimal policy of an optimal mapping of a given state to an action, wherein the executing of the MDP includes executing an objective function for a plurality of MDP execution iterations until meeting an MDP stopping condition that includes no change in the results of input parameters of the domain specific heuristics;

using the MDP optimal policy as input to pre-train a reinforcement learning (RL) model and/or deep RL (DRL) model, wherein the MDP optimal policy reduces an amount of time required to pre-train the RL model and/or DRL model;

executing the pre-trained RL model and/or DRL model to obtain a recommended policy;

selecting one of the plurality of domain specific heuristics and heuristic input parameters of the recommended policy;

controlling the machine for solving a predefined optimization problem in a plurality of execution iterations, the predefined optimization problem selected from the group consisting of a mixed integer linear programming problem and a mixed integer programming problem, wherein each execution iteration includes:

using an outputted RL and/or DRL recommended action for a current state which includes the selected domain specific heuristic and its input parameters;

receiving a result for the optimization problem and calculating a next state;

upon the RL and/or DRL model determining that a predefined stopping condition is met, stopping the execution iterations of the machine for solving the predefined optimization problem;

upon the RL and/or DRL model determining that the predefined stopping condition has not been met, inputting the next state to the RL and/or DRL model to receive an optimal action for the next state for a next iteration of the machine for solving the predefined optimization problem.