US 12,002,486 B2
	Tag estimation device, tag estimation method, and program
Ryo Masumura, Tokyo (JP); and Tomohiro Tanaka, Tokyo (JP)
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
Appl. No. 17/279,009
Filed by NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
PCT Filed Sep. 13, 2019, PCT No. PCT/JP2019/036005 § 371(c)(1), (2) Date Mar. 23, 2021, PCT Pub. No. WO2020/066673, PCT Pub. Date Apr. 2, 2020.
Claims priority of application No. 2018-180018 (JP), filed on Sep. 26, 2018.
Prior Publication US 2022/0036912 A1, Feb. 3, 2022
Int. Cl. G10L 25/30 (2013.01); G10L 15/02 (2006.01); G10L 15/10 (2006.01); G10L 25/48 (2013.01)

CPC G10L 25/30 (2013.01) [G10L 15/02 (2013.01); G10L 15/10 (2013.01); G10L 25/48 (2013.01)]

9 Claims

1. A tag estimation device comprising:

a hardware processor that:

retrieves a t-th utterance of a sequence of utterances in a dialogue from memory, wherein the sequence of utterances was stored in the memory following receipt of the sequence of utterances from a microphone or from a transmission of data over a communication line, wherein the t-th utterance is spoken by a speaker of a plurality of speakers in the dialogue, the t-th utterance includes a word, and the t is a natural number;

generates, based on the t-th utterance, a t-th speaker vector and a t-th utterance word feature vector;

retrieves from the memory a (t−1)-th utterance sequence information vector u_t-1that includes an utterance word feature vector that precedes the t-th utterance word feature vector and a speaker vector that precedes the t-th speaker vector,

generates, by a recurrent neural network (RNN), based on the t-th utterance word feature vector, the t-th speaker vector, and the (t−1)-th utterance sequence information vector u_t-1, a t-th utterance sequence information vector u_t,

wherein the t-th utterance sequence information vector u_t, is based on recursively combining pieces of a series of respective uttered word feature vectors and speaker vectors from a first utterance to the t-th utterance of the sequence of utterances in the dialogue, each speaker vector of an utterance in the sequence of utterances represents a speaker of the plurality of speakers speaking the utterance,

wherein the t-th utterance word feature vector is associated with at least the word in the t-th utterance spoken by the speaker, the t-th speaker vector is associated with the speaker, and the t-th utterance sequence information vector u_trepresents a feature associated with the t-th utterance in a sequence of utterances in a dialogue, the generating the t-th utterance sequence information vector u_tfurther comprises operating according to the formula:

c_t=[w_t^T,r_t^T]^T, and

u_t=RNN(c_t, u_t-1), the RNN represents a function having capabilities of a recurrent neural network, the T represents a transposition of a vector, the w_trepresents an utterance word feature vector of the t-th utterance, and the r_trepresents a speaker vector of the t-th utterance;

determines a tag l_tassociated with the t-th utterance, wherein the tag l_trepresents a result of analyzing the t-th utterance from a predetermined model parameter and the t-th utterance sequence information vector u_t, wherein the tag l_tspecifies a scene in the dialogue;

storing in the memory the t-th speaker vector, the t-th utterance word feature vector, and the i-th utterance sequence information vector u_tfor performing tag estimation of a subsequent utterance; and

transmitting the tag l_tassociated with the t-th utterance to program instructions configured to output the tag l_t.