US 11,941,668 B2
Ad exchange bid optimization with reinforcement learning
Danny Portman, Atlanta, GA (US); Zachary D. Jones, Atlanta, GA (US); and David Rose, Atlanta, GA (US)
Assigned to Zeta Global Corp., New York, NY (US)
Filed by Zeta Global Corp., New York, NY (US)
Filed on Feb. 28, 2023, as Appl. No. 18/115,380.
Application 18/115,380 is a continuation of application No. 17/379,959, filed on Jul. 19, 2021, granted, now 11,645,679.
Claims priority of provisional application 63/053,705, filed on Jul. 19, 2020.
Prior Publication US 2023/0206285 A1, Jun. 29, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06Q 30/0273 (2023.01); G06N 3/045 (2023.01); G06N 3/084 (2023.01); G06Q 30/0204 (2023.01); G06Q 30/0241 (2023.01); G06Q 30/0242 (2023.01); G06Q 30/0251 (2023.01)
CPC G06Q 30/0275 (2013.01) [G06N 3/045 (2023.01); G06N 3/084 (2013.01); G06Q 30/0246 (2013.01); G06Q 30/0205 (2013.01); G06Q 30/0249 (2013.01); G06Q 30/0256 (2013.01); G06Q 30/0276 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A machine learning system comprising:
one or more processors; and
a memory storing instructions that, when executed by at least one processor in the one or more processors, cause the at least one processor to perform operations comprising at least:
receive a request to deploy a piece of content to available inventory;
receive a bid result for a bid on the available inventory, the bid submitted to a real time bid server in response to the request;
store state data including a sequence of bids sent to the real time bid server, the bid result, and a response rate for the available inventory;
train a bidding model by:
using a first neural network, determining a plurality of action Q-values based on the state data, the plurality of action Q-values including at least one Q-value for each possible action at a current state of the bid server;
selecting an action based on a maximum action Q-value;
using a second machine learning model, determining a target Q-value for the selected action based on the state data and experience data, the experience data including the selected action and a reward earned for the selected action;
training the first neural network to update the plurality of action Q-values based on the target Q-value, the training using a stochastic gradient descent;
determine a bid action using the trained bidding model; and
submit a new bid on the available inventory to the real time bid server based on the bid action.