| CPC G06Q 30/0631 (2013.01) [G06F 9/505 (2013.01); G06Q 30/0643 (2013.01)] | 20 Claims |

|
1. A system comprising:
one or more processors; and
one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, cause the one or more processors to perform:
receiving a user request that is input to a web application included in an online component via a graphical user interface, the user request corresponding to a user search query from a user for a product;
determining, by a request distributor, included in the online component whether a system bandit service is operating in a first processing mode or a second processing mode, wherein the request distributor maintains computing resource availability and response time agreements, wherein the system bandit service operates as middleware between the online component and an offline component, wherein the system bandit service comprises a first distributed system with a first queue, a second distributed system with a second queue, an explore module, and an exploit module, wherein the request distributor determines which of the first queue or the second queue is a shorter queue with more memory, and wherein when the request distributor determines that the first queue is the shorter queue, the request distributor transmits the user request to the first queue of the first distributed system;
transmitting the user request to: (1) the exploit module, via the second distributed system, when the first processing mode is a high processing mode and the second processing mode is a low processing mode, or (2) the explore module, via the first distributed system, when the first processing mode is the low processing mode and the second processing mode is the high processing mode;
using a process of a reward model that comprises stage-wise exploration and exploitation with an optimal design, wherein the optimal design leverages a D-optimal design to increase information gain during exploration compared to not using the D-optimal design, by determining a randomized strategy for a plurality of candidate recommendation systems based on a ratio of a number of the plurality of candidate recommendation systems, the randomized strategy to be stored in a collected history data;
when the system bandit service is determined to be operating in the low processing mode, analyzing the user request, via the offline component, using the plurality of candidate recommendation systems and generating respective predicted reward values and corresponding pre-computations associated with the reward model, to be stored in the collected history data;
when the system bandit service is determined to be operating in the high processing mode, determining, via the system bandit service, the candidate recommendation system from the plurality of candidate recommendation systems as a recommendation system with a maximum reward value for the reward model based on the user request and the collected history data;
processing, via the system bandit service, the user request with one or more of the recommendation system or the one or more candidate recommendation systems to identify recommended products to display to the user; and
transmitting instructions, from the system bandit service to an algorithm service included in the online component, to modify the graphical user interface to display the recommended products to the user and determining a reward value for the one or more of the recommendation system or the one or more candidate recommendation systems, to be stored in the collected history data.
|