US 12,328,331 B1
Detection of privacy attacks on machine learning models
Hengrui Jia, Toronto (CA); Ravikumar Balakrishnan, Beaverton, OR (US); Zeliang Kan, London (GB); and Jason Martin, Beaverton, OR (US)
Assigned to HiddenLayer, Inc., Austin, TX (US)
Filed by HiddenLayer, Inc., Austin, TX (US)
Filed on Feb. 4, 2025, as Appl. No. 19/045,368.
Int. Cl. H04L 9/40 (2022.01)
CPC H04L 63/1425 (2013.01) [H04L 63/1416 (2013.01); H04L 63/1441 (2013.01)] 29 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
inputting each of a plurality of queries into an artificial intelligence (AI) model comprising a plurality of layers including an input layer, an output layer, and at least one intermediate layer between the input layer and the output layer, each intermediate layer, during inference, outputting a plurality of activations;
intercepting, for each query, activations from at least one of the intermediate layers;
determining whether a distribution of the intercepted activations across the queries indicates that the queries seek to cause the AI model to behave in an undesired manner by conducting a distance-based similarity analysis between the intercepted activations and reference activations, the distance-based similarity analysis comprising:
generating a first metric which characterizes overall divergence between the intercepted activations and the reference activations; and
generating a second metric which groups queries by class predictions and calculates a distance for each group relative to reference activations of a same class; and
providing data characterizing the determining to a consuming application or process.