US 12,292,905 B2
	Multi-turn dialogue system and method based on retrieval
Haifeng Sun, Beijing (CN); Zirui Zhuang, Beijing (CN); Bing Ma, Beijing (CN); Jingyu Wang, Beijing (CN); Cheng Zhang, Beijing (CN); Tong Xu, Beijing (CN); and Jing Wang, Beijing (CN)
Assigned to BEIJING UNIVERSITY OF POSTS AND TELECOMMUNICATIONS, Beijing (CN)
Filed by BEIJING UNIVERSITY OF POSTS AND TELECOMMUNICATIONS, Beijing (CN)
Filed on Jan. 10, 2023, as Appl. No. 18/095,196.
Claims priority of application No. 202210649202.3 (CN), filed on Jun. 9, 2022.
Prior Publication US 2023/0401243 A1, Dec. 14, 2023
Int. Cl. G06F 17/00 (2019.01); G06F 16/3329 (2025.01); G06F 16/334 (2025.01); G06N 3/08 (2023.01)

CPC G06F 16/3329 (2019.01) [G06F 16/3344 (2019.01); G06F 16/3347 (2019.01); G06N 3/08 (2013.01)]

1 Claim

1. A computer-implemented multi-turn dialogue method based on retrieval, comprising:

(1) converting each turn of dialogue into a cascade vector E_uof the dialogue, and converting a candidate answer r into a cascade vector E_rof the candidate answer; the cascade vector E_uof the dialogue is obtained by cascading a word level vector and a character level vector in the dialogue; the cascade vector E_rof the candidate answer is obtained by cascading a word level vector and a character level vector in the candidate answer; the word level vector is obtained by a tool Word2vec; the character level vector is obtained by encoding character information through a convolutional neural network;

(2) taking the cascade vector of the dialogue and the cascade vector of the candidate answer as an input, dynamically absorbing context information based on a global attention mechanism, and recursively calculating a k-th layer self-attention dialogue representation Û^k, a k-th layer self-attention candidate answer representation R^k, a k-th layer mutual attention dialogue representation Ū^k, a k-th mutual attention candidate answer representation R^k, a k-th layer dialogue synthesis representation U^k, and a k-th layer candidate answer synthesis representation R^k, by the following formulas, to obtain a matching vector (v₁, . . . , v_l):

Û^k=f_catt(U^k-1,U^k-1,C)

R^k=f_catt(R^k-1,R^k-1,C)

Ū^k=f_catt(U^k-1,R^k-1,C)

R^k=f_catt(R^k-1,U^k-1,C)

Ũ^k=[U^k-1,Û^k,Ū^k,U^k-1⊙Ū^k]

R^k=[R^k-1,R^k,R^k,R^k-1⊙R^k]

U^k=max(0,W_hŨ^k-1+b_h)

R^k=max(0,W_hR^k-1+b_h)+R^k-1

in the formulas, U^k-1∈ custom character

^m×dand R^k-1∈ custom character

^n×drepresent inputs of a k-th global interaction layer, wherein m and n represent the number of words contained in a current turn of dialogue and the number of words contained in the candidate answer, respectively, and inputs of a first global interaction layer are U⁰=E_u, R⁰=E^r; W_h∈ custom character

^4d×dand b_hare training parameters; an operator ⊕ represents a multiplication of elements; d represents a dimension of a vector;

C∈

^l^_c^×drepresents context obtained by cascading contents of all l turns of dialogues; all l turns of dialogues contain l_cwords, C can be obtained by cascading word level vectors of the l_cwords;

in the formulas, f_catt( ) represents the global attention mechanism, which is specifically defined as follows:

f_catt(Q,K,C)=Q+FNN(Q)

where, FNN(Q)=max(0,QW_f+b_f)W_g+b_g, wherein W_{f,g}∈ custom character

^d×dand b_{f,g} are trainable parameters, Q and Q are mixed using a residual connection to obtain Q, wherein Q is calculated according to the following formula:

Q=S(Q,K,C)·K

where, Q∈

ⁿ^_q^×drepresents a query sequence, K∈ custom character

ⁿ^_k^×drepresents a key sequence, wherein n_qand n_krepresent the number of words, S(Q,K,C)∈ custom character

ⁿ^_q^×n^_krepresents a similarity of Q and K in the context C; S(Q, K, C) is calculated according to the following formula:

where, W_{b,c,d,e}are trainable parameters, C_i^qrepresents an i-th row of C^q, and its physical meaning is fusion context information related to an i-th word in the query sequence Q; C_j^krepresents a j-th row of C^k, and its physical meaning is fusion context information related to a j-th word of the key sequence K;

C^q∈

ⁿ^_q^×dand C^k∈ custom character

ⁿ^_k^×drepresent context information compression vector fusing the query vector Q and context information compression vector fusing the key vector K, respectively, and are calculated according to the following formulas:

C^q=softmax(QW_aC^T)·C

C^k=softmax(KW_aC^T)·C

W_a∈

^d×dare training parameters; and

extracting a d dimension matching vector v_lfrom a matching image M_iof an i-th turn of dialogue by a convolutional neural network, and matching vectors from the first to 1-th turn of dialogues are represented by (v₁, . . . , v_l); the matching image M_iof the i-th turn of dialogue is calculated according to the following formula:

M_i=M_i,self⊕M_{i,interaction}⊕M_i,enhanced

where, M_i∈ custom character

^m^_i^×n×3, ⊕ is a cascading operation, mi is the number of words contained in the i-th turn of dialogue u_i; M_i,self, M_{i,interaction}and M_i,enhancedare calculated according to the following formulas:

(3) receiving the matching vector (v₁, . . . , v_l), processing the matching vector by an RNN network to obtain a short-term dependence information sequence (h₁, . . . , h_l), and processing the matching vector by a Transformer network to obtain a long-term dependence information sequence (g₁, . . . , g_l);

wherein a specific calculation process of the short-term dependence information sequence (h₁, . . . , h_l) is:

obtaining/hidden layer state vectors by processing the matching vector (v₁, . . . , v_l) through a GRU model, wherein an i-th hidden layer state is:

h_l=GRU(v_l,h_l-1)

where, h₀is initialized randomly;

a specific calculation process of the long-term dependence information sequence (g₁, . . . , g_l) is:

(g₁, . . . , g_l)=MultiHead(Q,K,V)

where,

Q=V_mW^Q, K=V_mW_K, V=V_mW^V,

where W^Q, W^Kand W^Vare training parameters; Multihead ( ) represents a multi-head attention function; V_m=(v₁, . . . , v_l);

(4) calculating a matching score of the context c and the candidate answer involved in matching according to the short-term dependence information sequence (h₁, . . . , h_l) and the long-term dependence information sequence (g₁, . . . , g_l), wherein the calculating includes:

calculating

to obtain (ĝ₁, . . . , g_l), wherein ⊕ represents the multiplication of elements;

then inputting (ĝ₁, . . . , ĝ_l) into a GRU model, to obtain:

g_i=GRU(ĝ_i,g_i-1)

wherein g₀is initialized randomly; a final hidden layer state of the GRU model is g₁;

calculating the matching score of the context c and the candidate answer r involved in matching based on g_l:

g(c,r)=σ(g_l·w_o+b_o)

where, σ(·) represents a sigmoid function, w_oand b_oare training parameters,

(5) selecting a candidate answer with a highest matching score as a correct answer.