US 12,243,513 B2
	Generation of optimized spoken language understanding model through joint training with integrated acoustic knowledge-speech module
Chenguang Zhu, Sammamish, WA (US); and Nanshan Zeng, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on May 18, 2021, as Appl. No. 17/323,847.
Claims priority of provisional application 63/205,646, filed on Jan. 20, 2021.
Prior Publication US 2022/0230629 A1, Jul. 21, 2022
Int. Cl. G10L 15/06 (2013.01); G06F 40/279 (2020.01); G10L 13/08 (2013.01); G10L 15/02 (2006.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01)

CPC G10L 15/063 (2013.01) [G06F 40/279 (2020.01); G10L 13/08 (2013.01); G10L 15/02 (2013.01); G10L 15/1815 (2013.01); G10L 15/22 (2013.01)]

14 Claims

1. A computer implemented method for joint training a speech module with a knowledge module for natural language understanding, the method comprising:

obtaining a first knowledge graph comprising a set of entities and a set of relations between two or more entities included in the set of entities;

transforming the first knowledge graph into an acoustic knowledge graph, the acoustic knowledge graph comprising a set of acoustic data corresponding to each entity of the set of entities and each relation of the set of relations between entities;

extracting a first set of acoustic features from the acoustic knowledge graph;

training a knowledge module on the set of acoustic features to generate knowledge-based entity representations for the acoustic knowledge graph;

pre-training a speech module on a training data set comprising unlabeled speech data to understand acoustic information from speech transcriptions;

pre-training a language module on a second training data set comprising unlabeled text-based data, the language module being configured to understand semantic information from speech transcriptions;

aligning the speech module and the language module, the speech module being configured to leverage acoustic information and language information in natural language processing tasks;

obtaining a third training data set comprising paired acoustic data and transcript data;

applying the third training data set to the speech module and language module;

obtaining acoustic output embeddings from the speech module;

obtaining language output embeddings from the language module;

after pre-training the language module, extracting a set of textual features from the first knowledge graph;

training a second knowledge module on the set of textual features; and

integrating the second knowledge module with the language module before the language module and speech module are aligned;

aligning the acoustic output embeddings and the language output embeddings to a shared semantic space; and

generating an integrated knowledge-speech module to perform semantic analysis on audio data by integrating the knowledge module with the speech module, the integrated knowledge-speech module being generated by providing context information from the speech module to the knowledge module and knowledge information from the knowledge module to the speech module.