CPC G10L 15/063 (2013.01) [G06F 40/279 (2020.01); G10L 13/08 (2013.01); G10L 15/02 (2013.01); G10L 15/1815 (2013.01); G10L 15/22 (2013.01)] | 14 Claims |
1. A computer implemented method for joint training a speech module with a knowledge module for natural language understanding, the method comprising:
obtaining a first knowledge graph comprising a set of entities and a set of relations between two or more entities included in the set of entities;
transforming the first knowledge graph into an acoustic knowledge graph, the acoustic knowledge graph comprising a set of acoustic data corresponding to each entity of the set of entities and each relation of the set of relations between entities;
extracting a first set of acoustic features from the acoustic knowledge graph;
training a knowledge module on the set of acoustic features to generate knowledge-based entity representations for the acoustic knowledge graph;
pre-training a speech module on a training data set comprising unlabeled speech data to understand acoustic information from speech transcriptions;
pre-training a language module on a second training data set comprising unlabeled text-based data, the language module being configured to understand semantic information from speech transcriptions;
aligning the speech module and the language module, the speech module being configured to leverage acoustic information and language information in natural language processing tasks;
obtaining a third training data set comprising paired acoustic data and transcript data;
applying the third training data set to the speech module and language module;
obtaining acoustic output embeddings from the speech module;
obtaining language output embeddings from the language module;
after pre-training the language module, extracting a set of textual features from the first knowledge graph;
training a second knowledge module on the set of textual features; and
integrating the second knowledge module with the language module before the language module and speech module are aligned;
aligning the acoustic output embeddings and the language output embeddings to a shared semantic space; and
generating an integrated knowledge-speech module to perform semantic analysis on audio data by integrating the knowledge module with the speech module, the integrated knowledge-speech module being generated by providing context information from the speech module to the knowledge module and knowledge information from the knowledge module to the speech module.
|