US 11,868,715 B2
Deep learning based automatic ontology extraction to detect new domain knowledge
Dnyanesh G. Rajpathak, Troy, MI (US); Ravi S. Sambangi, Rochester Hills, MI (US); and Xinli Wang, Troy, MI (US)
Assigned to GM GLOBAL TECHNOLOGY OPERATIONS LLC, Detroit, MI (US)
Filed by GM GLOBAL TECHNOLOGY OPERATIONS LLC, Detroit, MI (US)
Filed on Jan. 4, 2021, as Appl. No. 17/140,360.
Prior Publication US 2022/0215167 A1, Jul. 7, 2022
Int. Cl. G06F 40/20 (2020.01); G06F 17/18 (2006.01); G06N 3/047 (2023.01)
CPC G06F 40/20 (2020.01) [G06F 17/18 (2013.01); G06N 3/047 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A system comprising:
a processor; and
a memory storing instructions which when executed by the processor configure the processor to:
process unstructured data to identify a plurality of subsets of text in a set of text in the unstructured data;
determine, for a subset from the plurality of subsets, probabilities based on a position of the subset in the set of text, a part of speech (POS) of each word in the subset, and POSs of one or more words on left and right hand sides of the subset, a number of the one or more words being selected based on a length of the set of text;
generate a feature vector for the subset, the feature vector including the probabilities and additional features of the subset;
encode the probabilities for the subset into the feature vector for the subset and include the encoded probabilities as a feature in the feature vector;
classify, using a classifier, the subset into one of a plurality of classes based on the feature vector for the subset, the plurality of classes representing an ontology of a domain of knowledge; and
perform natural language processing (NLP) of the unstructured data using the encoded probabilities and the classifier including a transfer learning based classifier that uses context information, position features, syntactic information, and a distributional probability model based on POSs to extract and classify concepts from the unstructured data;
wherein the processor is configured to:
train a model using manually labeled first set of feature vectors generated from the unstructured data; and
automatically label second set of feature vectors generated from the unstructured data using the trained model,
wherein the second set of feature vectors is larger than the first set of feature vectors by one or more orders of magnitude; and
wherein the processor is configured to:
train the transfer learning based classifier using the larger automatically labeled second set of feature vectors to train low level layers of the transfer learning based classifier and then utilizing the manually labeled first set of feature vectors having a higher quality but smaller quantity than the larger automatically labeled second set of vectors to retrain top level layers of the transfer learning based classifier to improve a score of the transfer learning based classifier with which to classify feature vectors representing additional unstructured data into the plurality of classes.