| CPC G06F 16/243 (2019.01) [G06F 40/40 (2020.01)] | 20 Claims |

|
1. A method comprising:
generating an alignment score for a test language model from input data comprising a plurality of triplet data structures,
wherein each of the plurality of triplet data structures comprises a prompt for the test language model, a response generated by the test language model using the prompt, and an evaluation score quantifying an indication of pass or fail of the response with respect to an alignment of the response with a metric;
identifying, responsive to the alignment score failing to satisfy a score threshold, a fail triplet data structure in the plurality of triplet data structures for which the evaluation score comprises the indication of fail, wherein the fail triplet data structure comprises a fail prompt, a fail response, and a fail score;
executing a judge language model on the fail triplet data structure to output a type of misalignment that the test language model produced when the test language model executed on the fail prompt;
re-executing the judge language model on a combination of the fail triplet data structure and the type of misalignment to output a cause of the fail response;
generating an enhanced prompt by commanding the judge language model to modify, based on the cause, the fail prompt such that, when the test language model is executed on the enhanced prompt, a new response of the test language model is predicted to be an aligned response; and
returning the enhanced prompt.
|