US 12,468,920 B2
	Feedback driven decision support in partially observable settings
Sohini Upadhyay, Cambridge, MA (US); Yasaman Khazaeni, Needham, MA (US); Djallel Bouneffouf, Wappinger Falls, NY (US); and Mayank Agarwal, Cambridge, MA (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Jul. 29, 2019, as Appl. No. 16/524,291.
Prior Publication US 2021/0034926 A1, Feb. 4, 2021
Int. Cl. G06N 3/006 (2023.01); G06N 5/043 (2023.01); G06N 7/00 (2023.01)

CPC G06N 3/006 (2013.01) [G06N 5/043 (2013.01); G06N 7/00 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

a) receiving, by a reinforcement learning agent that is a machine learning system of a computer, initially observable feature values corresponding to a list of initially observable features, a list of initially unobservable features, and a maximum number of features that are observable per iteration of a first cycle, whereby the reinforcement learning agent comprises a reinforcement learning policy that governs selection performed by the reinforcement learning agent, and whereby the reinforcement learning policy comprises a first reinforcement learning policy for feature selection and a second reinforcement learning policy for action selection;

b) using, by the reinforcement learning agent implementing the first reinforcement learning policy, the initially observable feature values and the maximum number of features that are observable per the iteration of the first cycle to select a group of some of the initially unobservable features to explore, wherein the first reinforcement learning policy comprises a contextual combinatorial bandit algorithm and the using comprises applying the received initially observable feature values and the maximum number to the contextual combinatorial bandit algorithm and the contextual combinatorial bandit algorithm, in response, selects the group of some of the initially unobservable features to explore;

c) receiving, by the reinforcement learning agent, values for the selected group of the initially unobservable features;

d) using, by the reinforcement learning agent implementing the second reinforcement learning policy, the received initially observable features along with the received values for the selected group of the initially unobservable features to select a next action from a set of actions, wherein the second reinforcement learning policy comprises a contextual bandit algorithm and the using comprises applying to the contextual bandit algorithm the received initially observable feature values and the received values of the selected group of the initially unobservable features and the contextual bandit algorithm, in response, selects the next action from the set of actions;

e) receiving, via the reinforcement learning agent, feedback regarding performance of the selected next action;

f) updating, by the reinforcement learning agent and based on the received feedback, one or more parameters of the contextual combinatorial bandit algorithm of the first reinforcement learning policy for feature selection and one or more parameters of the contextual bandit algorithm of the second reinforcement learning policy for decision selection; and

g) repeating step b through step f over a settable number of iterations, whereby the first cycle comprises step b) through step f).