US 12,136,414 B2
	Integrating dialog history into end-to-end spoken language understanding systems
Samuel Thomas, White Plains, NY (US); Jatin Ganhotra, White Plains, NY (US); Hong-Kwang Kuo, Pleasantville, NY (US); Sachindra Joshi, Gurgaon (IN); George Andrei Saon, Stamford, CT (US); Zoltan Tueske, White Plains, NY (US); and Brian E. D. Kingsbury, Cortlandt Manor, NY (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Aug. 18, 2021, as Appl. No. 17/405,532.
Prior Publication US 2023/0056680 A1, Feb. 23, 2023
Int. Cl. G10L 15/16 (2006.01); G10L 15/065 (2013.01); G10L 15/18 (2013.01); G10L 15/183 (2013.01); G10L 15/08 (2006.01)

CPC G10L 15/16 (2013.01) [G10L 15/065 (2013.01); G10L 15/1815 (2013.01); G10L 15/183 (2013.01); G10L 2015/088 (2013.01)]

20 Claims

1. A system comprising:

at least one processor; and

at least one memory device coupled with the processor;

the at least one processor configured to:

receive audio signals representing a current utterance in a conversation and a dialog history including at least information associated with past utterances corresponding to the current utterance in the conversation;

encode the dialog history into an embedding, wherein a span of the dialog history is used in encoding the dialog history;

generate input features for a spoken language understanding neural network model by appending the embedding of the dialog history to acoustics features of the audio signals; and

train the spoken language understanding neural network model to perform a spoken language understanding task based on the input features, wherein an input layer of the spoken language understanding neural network model is expanded to receive both the acoustic features of the current utterance and embedding feature dimensions of the embedding representing the dialog history, wherein network parameters associated with expanded part of the input layer are randomly initialized,

wherein the embedding feature dimensions include at least types of the past utterances classified into dialog action classification tasks, wherein the dialog action classification tasks are classified using a trained multi-label binary classification task model.