US 12,265,896 B2
Systems and methods for detecting prejudice bias in machine-learning models
Jonathan Blake Brannon, Smyrna, GA (US); Ashok Kallarakuzhi, Atlanta, GA (US); Evan Bates, Atlanta, GA (US); Saravanan Pitchaimani, Atlanta, GA (US); and Vivek Srivastava, Atlanta, GA (US)
Assigned to OneTrust, LLC, Atlanta, GA (US)
Filed by OneTrust, LLC, Atlanta, GA (US)
Filed on Oct. 5, 2021, as Appl. No. 17/494,220.
Claims priority of provisional application 63/087,443, filed on Oct. 5, 2020.
Prior Publication US 2022/0108222 A1, Apr. 7, 2022
Int. Cl. G06N 20/00 (2019.01)
CPC G06N 20/00 (2019.01) 20 Claims
OG exemplary drawing
 
1. A method comprising:
generating a plurality of outputs by processing a known data set using a machine-learning model, wherein the known data set comprises a plurality of data instances associated with a plurality of sub-categories for each bias category in a plurality of bias categories in proportions to represent each sub-category of the plurality of sub-categories for each bias category in the plurality of bias categories;
generating a plurality of result instances comprising combinations of the plurality of data instances and the plurality of outputs, wherein a result instance of the plurality of result instances comprises a combination of a data instance of the plurality of data instances and a corresponding output of the plurality of outputs generated for the data instance utilizing the machine-learning model;
providing the result instance comprising the combination of the data instance and the corresponding output of the machine-learning model to a classification model;
generating, by computing hardware utilizing the classification model to process the result instance, a prediction of applicability for each sub-category of the plurality of sub-categories for each bias category of the plurality of bias categories for the combination of the data instance and the corresponding output of the machine-learning model in the result instance, wherein:
(a) the classification model comprises an ensemble comprising a multi-label classifier for each bias category of the plurality of bias categories, wherein the plurality of bias categories comprises one or more of religion, sexual orientation, age, ethnicity, gender, location, or political opinions, and
(b) each multi-label classifier is configured to generate the prediction of applicability by generating a probability that each sub-category of the plurality of sub-categories for a corresponding bias category of the plurality of bias categories applies to the combination of the data instance and the corresponding output of the machine-learning model;
determining, by the computing hardware and according to a plurality of predictions generated using the classification model for the plurality of sub-categories of the plurality of bias categories, that a particular sub-category of the plurality of sub-categories for a particular bias category of the plurality of bias categories is applicable to a proportion of the plurality of result instances, wherein a prediction value of the prediction of applicability for the particular sub-category for each applicable result instance found in the proportion of the plurality of result instances is at least a threshold prediction value;
comparing the proportion of the plurality of result instances in the particular sub-category of the particular bias category to a threshold percentage of a data set representing the particular bias category;
determining, by the computing hardware, that the machine-learning model has a prejudice bias with respect to the particular bias category in response to determining that the proportion of the plurality of result instances in the particular sub-category satisfies the threshold percentage of the data set representing the particular bias category; and
causing, by the computing hardware, a computing system to:
generate a modified data set by adding data instances for one or more of the plurality of bias categories based on the prejudice bias to the known data set; and
re-train the machine-learning model using the modified data set to operate in a less biased manner for an artificial intelligence application.