US 12,248,767 B2
Code generation through reinforcement learning using code-quality rewards
Shao Kun Deng, New York City, NY (US); Neelakantan Sundaresan, Bellevue, WA (US); Alexey Svyatkovskiy, Bellevue, WA (US); and Michele Tufano, Bellevue, WA (US)
Assigned to Microsoft Technology Licensing, LLC., Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC., Redmond, WA (US)
Filed on Feb. 20, 2024, as Appl. No. 18/582,248.
Application 18/582,248 is a continuation of application No. 17/555,263, filed on Dec. 17, 2021, granted, now 11,941,373.
Prior Publication US 2024/0192927 A1, Jun. 13, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 8/33 (2018.01); G06F 18/21 (2023.01); G06N 3/04 (2023.01)
CPC G06F 8/33 (2013.01) [G06F 18/217 (2023.01); G06N 3/04 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A system comprising:
a processor; and
a memory that stores a program configured to be executed by the processor, the program comprising instructions to perform actions that:
access a first deep learning model previously trained to generate source code for a first source code task, wherein the first deep learning model comprises parameters learned through cross-entropy loss;
tune the parameters of the first deep learning model to train a second deep learning model to learn to generate source code for a second source code task, wherein tune the parameters of the first deep learning model to train the second deep learning model comprises instructions to perform actions that:
input a training sample to the first deep learning model and to the second deep learning model, wherein the first deep learning model predicts a first predicted source code snippet over T timesteps, wherein the second deep learning model predicts a second predicted source code snippet over T timesteps;
compute a code-quality reward for the second predicted source code snippet, wherein the code-quality reward is based on syntax correctness of the second predicted source code snippet, successful execution of the second predicted source code snippet, successful compilation of the second predicted source code snippet, and successful invocation of the second predicted source code snippet;
compute a reward for the second predicted source code snippet at each timestep t based on a divergence between an output distribution from the first deep learning model at each time step t and an output distribution from the second deep learning model at each time step t;
add the code-quality reward to the reward of the last timestep;
compute a policy loss based on the rewards of each timestep t; and
backpropagate the policy loss to the second deep learning model to adjust the parameters of the second deep learning model; and
deploy the second deep learning model in an inference system to generate source code for the second source code task.