US 12,450,428 B2
	Systems and methods for semantic code search
Akhilesh Deepak Gotmare, Singapore (SG); Junnan Li, Singapore (SG); Shafiq Rayhan Joty, Singapore (SG); and Chu Hong Hoi, Singapore (SG)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce.com, Inc., San Francisco, CA (US)
Filed on Nov. 19, 2021, as Appl. No. 17/531,591.
Claims priority of provisional application 63/189,854, filed on May 18, 2021.
Prior Publication US 2022/0374595 A1, Nov. 24, 2022
Int. Cl. G06F 40/226 (2020.01); G06F 40/151 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01)

CPC G06F 40/226 (2020.01) [G06F 40/151 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01)]

20 Claims

1. A method for a natural language code search system, the method comprising:

receiving, via a communication interface, a training corpus of bimodal pairs, wherein at least one training pair from the training corpus includes a natural language description and a corresponding programming language snippet;

encoding, at a beginning of a first training epoch for training a code search neural network model using the training corpus, by an encoder associated with a first set of model parameters of the code search neural network model implemented on one or more processors, the training corpus of natural language descriptions and programming language snippet into a set of representations in a feature space, wherein the natural language description is encoded into a natural language representation and the corresponding programming language snippet is encoded into a programming language representation, wherein the natural language representation and the programming language representation form a positive pair;

determining, by the one or more processors, a set of nearest neighbor representations to at least one of the natural language representation or the programming language representation among the set of representations in the feature space of the training corpus;

forming a set of negative pairs, including forming each negative pair in the set of negative pairs to include the natural language representation or the programming language representation, and one nearest neighbor representation in the set of nearest neighbor representations;

training the code search neural network model including updating the first set of model parameters at an end of the first training epoch based at least in part on a contrastive learning loss comparing the positive pair and the set of negative pairs;

dynamically re-encoding, at a next training epoch for training the code search neural network model using the training corpus, by the encoder associated with the updated first set of model parameters, the training corpus of natural language descriptions and programming language snippet into an updated set of representations in the feature space;

training the code search neural network model at the next training epoch using the contrastive learning loss of the positive pair and the set of negative pairs selected based on the updated set of representations;

receiving, by the communication interface, a natural language search query for a programming language snippet; and

outputting, by the trained code search neural network model, the programming language snippet based on an input of the natural language search query.