CPC G06F 21/554 (2013.01) [G06N 3/04 (2013.01); G06N 20/00 (2019.01); G06F 2221/034 (2013.01)] | 20 Claims |
1. A computer-implemented method for detecting adversarial attacks on a machine-learning (ML) system, the method comprising:
receiving by an ML model of the ML system input data;
processing by the ML model the input data to generate output data;
receiving by an adversarial detection module of the ML system both the input data and the output data;
inputting a perturbed input data and the output data into a neural fingerprinting model included in the adversarial detection module, wherein the perturbed input data is generated by introducing a set of predefined random perturbations into the input data;
generating by the neural fingerprinting model a perturbed output data based on the perturbed input data;
determining using the neural fingerprinting model an adversarial score indicating whether the perturbed output data matches an expected perturbed output data for a class of data associated with the input data and the output data; and
performing one or more remedial actions based on the adversarial score.
|