CPC G06F 40/284 (2020.01) [G06F 40/30 (2020.01); G06N 20/00 (2019.01)] | 20 Claims |
1. A computer implemented method for natural language processing, the method comprising:
receiving, by a tokenization module, a base sentence and one or more sentences comprising a semantic perturbation of the base sentence as an input, wherein the semantic perturbation of the base sentence comprises one or more linguistic deviations of the base sentence from a first version;
tokenizing, by the tokenization module, the input to generate a sequence of tokens;
embedding, by a machine learning engine, tokens of the semantic perturbation with tokens of the base sentence as tokens pairs to generate training data;
classifying, by a classifier, the semantic perturbation of the token pairs to capture relationships of the base sentence and the one or more sentences to generate a classification; and
training, by the machine learning engine, a language model based at least in part on the training data and the classification; and
wherein at least one of the receiving, tokenizing, determining, embedding and training are performed by one or more computers.
|