US 12,235,955 B2
Method and system for detecting model manipulation through explanation poisoning
Allan Anzagira, Long Island City, NY (US); Freddy Lecue, Mamaroneck, NY (US); Daniele Magazzeni, London (GB); and Saumitra Mishra, London (GB)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMorgan Chase Bank, N.A., New York, NY (US)
Filed on Jan. 13, 2023, as Appl. No. 18/096,873.
Prior Publication US 2024/0241952 A1, Jul. 18, 2024
Int. Cl. G06F 21/56 (2013.01)
CPC G06F 21/56 (2013.01) 20 Claims
OG exemplary drawing
 
1. A method for detecting attempted manipulation of a machine learning model via explanation poisoning, the method being implemented by at least one processor, the method comprising:
receiving, by the at least one processor, a set of raw data that is usable for training a first model;
training, by the at least one processor, the first model by using the set of raw data;
selecting, by the at least one processor, a set of target data based on the set of raw data;
computing, by the at least one processor, a first explanation based on an output of the first model with respect to a first data point included in the set of target data, the first explanation including first information that relates to at least one first feature that affects the output of the first model with respect to the first data point;
computing, by the at least one processor, a second explanation based on an output of the first model with respect to a second data point included in the set of target data, the second explanation including second information that relates to at least one second feature that affects the output of the first model with respect to the second data point;
assigning, by the at least one processor based on the at least one first feature, a first label to the first explanation, and assigning, by the at least one processor based on the at least one second feature, a second label to the second explanation;
generating, by the at least one processor based on the first label and the second label, an explanation ensemble that resides in an N-dimensional space, N being equal to a number of assigned labels plus one;
transforming, by the at least one processor, the set of raw data into data that resides in the N-dimensional space;
determining, by the at least one processor based on the transformed set of raw data, a region within the N-dimensional space for which a subsequent introduction of additional data from the target set of data causes a subsequent explanation that does not relate to at least one from among the at least one first feature and the at least one second feature; and
when the additional data is introduced to the determined region, generating, by the at least one processor, an alert message that includes information for notifying a user that a likelihood of adverse manipulation of the first model is high based on the additional data.