US 12,073,299 B2
	Systems and methods for using contrastive pre-training to generate text and code embeddings
Arvind Neelakantan, San Francisco, CA (US); and Tao Xu, San Francisco, CA (US)
Assigned to OpenAI OpCo, LLC, San Francisco, CA (US)
Filed by OpenAI OpCo, LLC, San Francisco, CA (US)
Filed on Jan. 23, 2023, as Appl. No. 18/158,166.
Prior Publication US 2024/0249186 A1, Jul. 25, 2024
Int. Cl. G06N 20/00 (2019.01)

CPC G06N 20/00 (2019.01)

20 Claims

1. A computer-implemented method for generating a semantic similarity based on a vector representation, the method comprising:

receiving a training data set extracted from unlabeled data, the training data set including a plurality of paired data samples corresponding to positive example pairs, each positive example pair including a first data unit and a second data unit, wherein the first data unit and the second data unit are located within a predetermined distance threshold of each other within the unlabeled data;

converting the paired data samples corresponding to the positive example pairs into at least one first vector of a vector representation;

accessing one or more negative example pairs within the training data set to contrast against the positive example pairs;

converting the one or more negative example pairs into one or more second vectors of the vector representation; and

training a machine learning model to generate additional vectors of the vector representation, wherein the training comprises:

initializing the machine learning model with one or more pre-trained models, the one or more pre-trained models comprising generative language models; and

training the machine learning model using contrastive training based on: the at least one first vector of the vector representation and the one or more second vectors of the vector representation;

receiving a query for semantic similarity, the query including a natural language input; and

generating, with the machine learning model and according to an embedding space, a semantic similarity result in response to the query.