| CPC G06F 40/40 (2020.01) | 20 Claims |

|
1. A computer-implemented method comprising:
receiving first input data that was previously processed by a large language model (LLM) to generate an undesired response, the first input data corresponding to a first natural language input;
receiving first output data representing a desired response to the first input data;
determining first model loss data corresponding to processing of the first input data by the LLM, the first model loss data being based on the first output data;
determining, using the first model loss data, a first plurality of gradients corresponding to a first layer of the LLM and a second plurality of gradients corresponding to a second layer of the LLM;
determining, by combining the first plurality of gradients, a first value corresponding to the first layer;
determining, by combining the second plurality of gradients, a second value corresponding to the second layer;
determining, based on a comparison of the first value and the second value, that processing by the first layer resulted in the undesired response;
in response to determining that processing by the first layer resulted in the undesired response, determining an updated first layer by modifying a first plurality of weights of the first layer, wherein the updated first layer is configured to cause generation of the first output data when processing data corresponding to the first natural language input;
determining an updated LLM based on the LLM and including the updated first layer instead of the first layer; and
using the updated LLM to process second input data.
|