US 12,493,795 B2
Systems and methods for unsupervised training in text retrieval tasks
Rui Meng, San Francisco, CA (US); Yingbo Zhou, Palo Alto, CA (US); Ye Liu, Fremont, CA (US); Semih Yavuz, Redwood City, CA (US); and Ning Yu, Palo Alto, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Apr. 19, 2023, as Appl. No. 18/303,313.
Claims priority of provisional application 63/387,673, filed on Dec. 15, 2022.
Prior Publication US 2024/0202530 A1, Jun. 20, 2024
Int. Cl. G06N 3/084 (2023.01); G06F 40/20 (2020.01); G06F 40/40 (2020.01); G06N 3/0455 (2023.01); G06N 3/088 (2023.01)
CPC G06N 3/084 (2013.01) [G06F 40/20 (2020.01); G06F 40/40 (2020.01); G06N 3/0455 (2023.01); G06N 3/088 (2013.01)] 15 Claims
OG exemplary drawing
 
1. A method of training a text retrieval model, the method comprising:
receiving, via a data interface, a plurality of text documents;
generating, by a processor, a query corresponding to at least one text document from the plurality of text documents, wherein the generating includes at least one of:
(a) extracting a text span from the at least one text document as the query based on a relevance level between the extracted text span and the at least one text document; or
(b) generating, by a pre-trained language model, a text output as the query based on an input of the at least one text document conditioned with a pre-defined prompt;
selecting a negative sample document from the plurality of text documents;
computing a first loss objective based on the query, the at least one text document, and the negative sample document;
training the text retrieval model by updating parameters of the text retrieval model based on the computed first loss objective via backpropagation;
receiving, via the data interface, a second plurality of text documents;
computing a second loss objective using at least one text document of the second plurality of text documents as a finetuning dataset, wherein the second plurality of text documents are annotated with a plurality of queries, respectively, and, wherein the second loss objective is a contrastive loss based on the plurality of queries and the second plurality of text documents; and
finetuning the trained text retrieval model by updating the parameters of the text retrieval model based on the computed second loss objective.