US 12,271,703 B2
Reinforcement learning techniques for dialogue management
Rajesh Virupaksha Munavalli, San Jose, CA (US)
Assigned to PayPal, Inc., San Jose, CA (US)
Filed by PayPal, Inc., San Jose, CA (US)
Filed on Sep. 29, 2021, as Appl. No. 17/489,356.
Claims priority of provisional application 63/086,715, filed on Oct. 2, 2020.
Prior Publication US 2022/0108080 A1, Apr. 7, 2022
Int. Cl. G06F 40/35 (2020.01); G06N 3/08 (2023.01); G06F 16/3329 (2025.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01)
CPC G06F 40/35 (2020.01) [G06N 3/08 (2013.01); G06F 16/3329 (2019.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
performing, by a computer system, an iterative training operation to train a deep Q-learning network (“DQN”) based on conversation log information corresponding to a plurality of prior conversations, wherein the DQN includes:
an input layer to receive an input value indicative of a current state of a given conversation;
one or more hidden layers; and
an output layer that includes a plurality of output nodes corresponding to a plurality of available responses;
wherein, for a first conversation log corresponding to a first one of the plurality of prior conversations, the iterative training operation includes:
determining a current state of the first prior conversation based on a first user utterance, wherein determining the current state of the first prior conversation includes, for the first user utterance, identifying a first cluster of user utterances from a plurality of clusters of user utterances, the plurality of clusters of user utterances associated with the plurality of prior conversations partitioned into the plurality of clusters;
generating a first input value to the DQN based on the current state of the first prior conversation;
applying the first input value to the DQN to identify a first response, from the plurality of available responses, to provide to the first user utterance; and
updating the DQN based on a first reward value provided based on the first response; and
repeating, by the computer system, the iterative training operation using a second conversation log corresponding to a second one of the plurality of prior conversations.