US 12,293,155 B2
Out-of-domain data augmentation for natural language processing
Elias Luqman Jalaluddin, Seattle, WA (US); Vishal Vishnoi, Redwood City, CA (US); Thanh Long Duong, Melbourne (AU); Mark Edward Johnson, Sydney (AU); Poorya Zaremoodi, Melbourne (AU); Gautam Singaraju, Dublin, CA (US); Ying Xu, Albion (AU); Vladislav Blinov, Melbourne (AU); and Yu-Heng Hong, Carlton (AU)
Assigned to Oracle International Corporation, Redwood Shores, CA (US)
Filed by Oracle International Corporation, Redwood Shores, CA (US)
Filed on Apr. 9, 2024, as Appl. No. 18/630,772.
Application 18/630,772 is a continuation of application No. 17/452,743, filed on Oct. 28, 2021, granted, now 12,026,468.
Claims priority of provisional application 63/119,526, filed on Nov. 30, 2020.
Prior Publication US 2024/0256777 A1, Aug. 1, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 40/289 (2020.01); G06F 40/30 (2020.01); G06N 3/08 (2023.01); H04L 51/02 (2022.01)
CPC G06F 40/289 (2020.01) [G06F 40/30 (2020.01); G06N 3/08 (2013.01); H04L 51/02 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving a training set of utterances comprising in-domain examples;
augmenting the training set of utterances with out-of-domain (OOD) examples to generate augmented batches of utterances for training a machine-learning model, wherein the augmenting comprises:
generating a data set of the OOD examples,
filtering out a plurality of OOD examples from the data set of the OOD examples, based on a determination that context of each of the plurality of OOD examples has a substantial similarity to context of one or more of the utterances of the training set of utterances, and
generating the augmented batches of utterances, each of the augmented batches of utterances comprising utterances from the training set of utterances and utterances from the filtered data set of the OOD examples; and
training the machine-learning model using the augmented batches of utterances, wherein the trained machine-learning model is configured to, based on one or more utterances provided as an input by a user, identify an intent from a set of predetermined intents,
wherein the substantial similarity between the context of OOD utterances of the data set of the OOD examples and the context of the utterances of the training set is determined based on a distance measure using a Multilingual Universal Sentence Encoder (MUSE) single embedding, and
wherein if min (d_i) is less than a predetermined threshold, then the context of an OOD utterance of the data set of the OOD examples and the context of an utterance of the training set of utterances is determined to be substantially similar,
where d_i is an Euclidean distance (v_i, u),
v_i is a vector representation of an utterance (x_i) of the training set of utterances and is muse (x_i) where i=1→n, and
u is a vector representation of the OOD utterance of the data set of the OOD examples and is muse (OOD utterance).