US 12,277,409 B1
	Systems, methods, and graphical user interfaces for training a code generation model for low-resource languages
Samuel Paul Leeman-Munk, Durham, NC (US); Xiaozhuo Cheng, Cary, NC (US); and Xiaolong Li, Cary, NC (US)
Assigned to SAS INSTITUTE INC., Cary, NC (US)
Filed by SAS INSTITUTE INC., Cary, NC (US)
Filed on Sep. 24, 2024, as Appl. No. 18/895,119.
Claims priority of provisional application 63/609,236, filed on Dec. 12, 2023.
Claims priority of provisional application 63/601,984, filed on Nov. 22, 2023.
Int. Cl. G06F 11/36 (2006.01); G06F 8/35 (2018.01); G06F 11/3604 (2025.01)

CPC G06F 8/35 (2013.01) [G06F 11/3612 (2013.01)]

30 Claims

1. A computer-program product comprising a non-transitory machine-readable storage medium storing computer instructions that, when executed by one or more processors, perform operations comprising:

identifying a plurality of code synthesis items for a target programming language;

generating a code synthesis prompt based on a first sampling of the plurality of code synthesis items;

synthesizing, via a large language model, a plurality of raw code segments using the code synthesis prompt;

executing the plurality of raw code segments with a code interpreter associated with the target programming language;

determining one or more valid code segments of the plurality of raw code segments that the code interpreter successfully executed;

aggregating, via a second sampling, the one or more valid code segments into one or more validated code synthesis training samples, wherein a respective validated code synthesis training sample of the one or more validated code synthesis training samples at least includes:

a natural language description of a target coding task, and

one or more code segments that implement the target coding task; and

training a code generation model using the one or more validated code synthesis training samples, wherein:

the training the code generation model includes using supervised learning to train the code generation model;

the supervised learning causes the code generation model to learn to map the natural language description of the target coding task to the one or more code segments that implement the target coding task.