A review of fairness in machine learning
Machine learning models should ideally be perceived more as righteous superheroes than as unjust villains. But how do we ensure that our models are being classified and represented as such? Or rather, how do we ensure models to be fair and just in their outcome when they have an impact on humans — like every other superhero movie, the goal is to save humanity. This paper gives a broad introduction to the concepts, aspects and challenges of fair machine learning, as well as discussing results from a small survey conducted about people’s perception and comprehension of algorithmic fairness. We use 'superheroes' as an illustrative analogue for fair models. Through interactive illustrations we invite the reader to take a stand and check their comprehension about fairness in algorithms.
... to this new universe of fairness. Data-driven models like machine learning (ML) models are more and more applied in society
… of this paper will circulate the question of what a fair model is by presenting an illustrative introduction to fairness in machine learning. The first scenes in the prologue have already introduced the theme, and from here, the storyline will take us through why bias occurs, how fairness can be defined and bias mitigated, all the way to perceived algorithmic fairness. The main contributions of this interactive article are two-folded. First, it summarises the current state of fairness in machine learning through a literature review presented as an interactive paper that encourages people to think about the issues and challenges with algorithmic fairness. Second, a conducted survey examines people’s comprehension, self-reported understanding and perception of different fairness criteria in a superhero setting. It further touches upon how different formulations impact people’s opinions.
We will focus on explaining the following three mathematical fairness criteria: 1) Demographic Parity (DP), 2) Equalized Opportunity (EO) and 3) Predictive Parity (PP). These criteria are also used in the survey’s examination of both comprehension and perception of fairness. Questions from the survey will be presented to the reader throughout the paper. The survey details are outlined further in Appendix A and the infobox below summarises some brief details about it and introduces the interactive questions in the paper:
We invite you to take a stand on algorithmic fairness during the read by answering the questions. You can start with the following:
The survey, like this paper, uses a setting about super figures as an example. We chose this fictional setting to give a story easy to relate to. At the same time it avoids a true or realistic case to which people might be biased from the beginning, such that the essence can be algorithmic fairness in general and not focused on the actual case. The setting is formulated along the lines of a set of super figures, who can be either heroes or villains, all wanting to go to a party. They are assessed at the door, e.g. by an algorithm, since only the figures who are “believed” to be heroes are allowed in. The focal question is how to ensure that both the group of male and female figures are treated fair, as well as what fairness means in this setting.
Data for an example case was extracted from the Kaggle superhero dataset and used in the paper for illustrative purposes. A simple classification model was trained on the dataset to distinguish between villains and heroes where features such as superpowers, names, height and weight are used. Read more about pre-processing, feature selection and model training in Appendix B.
From hereon, the paper is divided into four main sections. The first section focuses on understanding the notion of bias, the second section describes definitions of fairness and methods to mitigate bias. The third section discusses the perception of fairness and highlights results from the survey. The last section sums up the remaining challenges and concludes the main points of this paper.
… which gives identity. Kate Crawford presented in
… can be denoted as an allocation problem. How resources and opportunities are allocated can be skewed and cause harm to a group, e.g. in loan applications
… of the bias notion, would be not to look at the potential harm but at the source of its occurrence. Mitchell et al.
Both dimensions of bias, characterised either by the impact or the occurrence find coherences in Olteanu and others examination of social data
… the data, the algorithm, and the user interactions
An undesired skew can also be introduced into the model by the choice of the algorithm or evaluation scheme
… by stating what we mean with a just and fair model. Fairness is defined in the Cambridge Dictionary as the quality of treating people equally or in a way that is right or reasonable. What it means to treat people right and how it can be operationalised with respect to algorithmic decision making is what the fairness literature is trying to determine. However, this is no easy question to answer. Looking back at the "CEO" image search example, it is not clear what a righteous result should yield. Kate Crawford, for example, raised the question if the proportion of male and female images (and images from different ethnicities) should resemble the current statistic of people working as "CEO", or whether it should be what people think is the right proportion. After all, image search is contributing to shaping our perception of reality
The other case from the introduction about criminal risk assessment is heavily used as an example in the literature, e.g.
These two examples illustrate the challenge in defining and deciding what a fair model is. Clear definitions of bias are essential to be able to spot and identify biases in the first place — without it, it is not clear what to look for when examining systems for potential biases. Especially in the cases of representational harm, it can be hard to discover undesired model behaviour and defining a more righteous behaviour. But an example were several attempts have been made is regarding identifying and removing stereotypes in word embeddings, e.g.
Regarding outcome fairness, a lot of work has been conducted on formalising fairness mathematically
… of mathematical fairness. Mathematical fairness criteria with respect to classification problems, where a certain treatment or outcomes are the desired output, can be grouped together in different ways. In the following box we will highlight five ideas on formalizing algorithmic fairness partly based on
… is a great movie, and like Batman every super figure needs to learn to master their powers in the beginning. We will, therefore, start by understanding the most known statistical measures of fairness (Demographic Parity, Equalized Opportunity and Predictive Parity), before we discuss the advantages and disadvantages of the statistical approach.
As described in the box above, statistical measures are derived from the so called confusion matrix. The confusion matrix is a way to sort all examples in a dataset according to their true and predicted class in a binary classification setting. The following table shows a confusion matrix and its components, and describes the terminology needed for the definition of the specific statistical fairness criteria:
To define different statistical fairness criteria, we will use the following terminology:
Demographic Parity (DP) also called group fairness or statistical parity
It can also be thought of as achieving the same Positive Rate (PR) for each group. The PR is calculated as the number of positive predictions divided by the total number of members in each group:
In our super figure example this means that the fraction of females and the fraction of males getting accepted to the party should be equalised, or in practice, be similar. In the survey, we investigated people’s comprehension of this fairness criterion. You can check your understanding as well with the following questions:
The last question points to the critique to Demographic Parity by Hardt et al.
In our example, where we trained a classifier to distinguish “hero” or “villain”, we can, on the evaluation set, look at the difference in the Positive Rate for males and females for different threshold values. Try to achieve parity for the Rositive Rate by choosing different threshold values in the figure below:
Equalized Opportunity (EO) and Equalized Odds is proposed by Hardt et al.
Relaxing the formulation to
In our super figure example, this means that the chance of getting accepted to the party when you, in fact, are a hero should be the same for both males and females. You can verify your own understanding of Equalized Opportunity with the following questions:
Looking at our superhero classifier´s performance on the evaluation set, we can experiment with satisfying the parity of the True Positive Rate by setting different threshold values for the two groups. In
Predictive Parity (PP), also described as outcome test, is a statistical measure which requires that the probability of a correct prediction of the positive class is the same for all groups
It can also be thought of as achieving the same Positive Predicted Value (PPV) between the groups. The PPV is calculated per group as the number of correct positive predictions divided by the total number of predicted positive members in a group:
For our super figure examples, this means that the chance of a correct prediction for figures allowed into the party should be the same for both males and females. Once again, you can try verifying your understanding and reflect on Predictive Parity with the following questions:
Looking at our super figure classifier's performance on the evaluation set, we can experiment with satisfying the parity of the Positive Predictive Value by setting different threshold values for the two groups and see how this affects both Demographic Parity and Equalized Opportunity:
… is not going to be easy. In the previous section, we saw that it is not possible to simultaneously achieve Demographic Parity, Equalized Opportunity and Predictive Parity in the super figure example, unless we almost let no one into the club. This is not a special case. It is proven that, except in trivial cases, many common known notions of fairness are incompatible with each other and furthermore conflict with optimising accuracy
Others criticize that the statistical approach only gives guarantees for the average of a group and not for individuals or even subgroups
Another criticism is the stationary property of statistical measures
Yet another critique comes from
Despite its caveats, the strength of statistical measures of fairness is the ease of computing them and if applicable, they are also easy to achieve, e.g. by adjusting the classifier’s threshold. Both the verification and adjustment do not require assumptions of the data, in contrast to for example individual and counterfactual fairness. However, choosing and interpreting any disparity metric requires understanding the bias occurrence and a discussion of whether an adjustment of the model to meet the parity is the desired outcome.
We will return to the three statistical metrics in the section Perceived Algorithmic Fairness to examine how people perceive them and investigate people’s comprehension based on the survey results. But first, let us look at methods for mitigating biases.
… to defend it with all our superhero powers. Until now this paper has listed different ways of defining fairness and especially, there has been a focus on statistical measures in the form of three parity metrics. But now it is time to show the weapons for mitigating and monitoring biases. In the interactive diagram below, we illustrate the life-cycle of a machine learning system with its different phases and highlight methods to intervene or mitigate biases, which is revealed when clicking on a phase. A similar approach was presented by Suresh and Guttag, where they assign six types of biases and their mitigation approaches to different phases in the ML life-cycle
… or in other words, which is your favourite fairness criterion?
The question of which statistical fairness measure people perceive as more fair has previously been addressed in the literature, e.g.
To measure the participants’ comprehension of each criterion, a score is computed as the percentage of correct answers to the comprehension questions (four questions per criterion). This yields an indication of the participants' understanding of the fairness criterion and it allows for a comparison of the comprehension between the three criteria. Figure 2 shows the distributions of the participants’ comprehension score on each fairness criterion.
The distributions indicate that participants found Demographic Parity easiest to understand. A pairwise performed Mann-Whitney U test
We also asked the participants to self-report their understanding of each criterion after answering the comprehension questions using a five-point Likert-scale (see an example in this question). Each point is assigned a numeric value, where a higher value indicates a higher self-reported understanding. The distributions over the self-reported understanding is presented as boxplots in Figure 2. The pairwise comparisons all show a significant difference between self-reported comprehension (DP-EQ: p=0.010, DP-PP: p=0.000, EQ-PP: p=0.008), meaning that people report different levels of understanding for the three measures. Similar to the computed comprehension scores, the self-reported comprehension has the following order: 1.) Demographic Parity, 2.) Equalized Opportunity and 3.) Predictive Parity.
We examined whether there is an association between the average self-reported and the average computed comprehension score. A calculation of the Spearman correlation
After the comprehension questions of each criterion, the participants were asked to state how fair they perceive the criterion on a five-point Likert-scale (see an example of the formulation in this question), i.e. they were asked to evaluate the fairness of each criterion independent of the other criteria. The Likert-scale is transformed to numeric values, and the boxplots in Figure 3 shows the different distributions. The fairness assessment resulted in the following ranking: 1.) Equalized Opportunity, 2.) Predictive Parity, 3.) Demographic Parity.
In the survey conducted in
For Demographic Parity we can see a significant, negative correlation between comprehension and perception, i.e. the better people understand the criterion the less they perceive it as fair (see Table 2). This is in accordance with the results from Saha et al.
… and so it seems to be the case with algorithmic fairness. Instead of only looking at which statistical measure is perceived as most fair, we need to broaden our understanding of the fairness universe. As we discussed earlier, bias can be introduced when interacting with the system, and hence the interaction with the system can in turn also affect the perceived fairness of the system. In general, there seems to be more to fairness than mathematical formulations, e.g.
Human or machine, what would people prefer to be assessed by in different settings? This question was also raised by Harrison et al., who found a slight preference towards human judgement over a machine learning model
The answers show that, initially, the majority (77%) of participants preferred the option of a “human judgment supported by an algorithm” when it is assumed that both the algorithm and the human act reasonably (see this question). However, this changes with the extra information that occasionally the human is biased against one sex. Under this assumption, the majority (56%) of participants preferred the algorithm, even though it is still, in general, assumed that both are acting reasonably. The third question assumes that the algorithm is much more accurate than the human but also biased. This shifts the preference of the majority (71%) back to preferring “human judgment with algorithmic support”. Our survey consists of a small sample size, and the representativeness is not accounted for. Nevertheless, it is interesting how relatively easy people’s opinion can be changed by adding some weak information about the system. In fact, 54% of the participants changed their choice throughout the three questions. One reason could be the human cognitive bias to weight weak information, to which we have an emotional positive or negative feeling (here the human's or algorithm's “bias”), higher than objective information (here the human's or algorithm's accuracy)
In addition to people’s preference for human or algorithmic judgement, the survey also asked a more high-level question (see this question) regarding whether it is fair to use “a system that uses data and statistics” in the super figure setting. To this question, 22 out of 92 participants answered a clear “no”. But when asked about whether they preferred human or algorithmic judgement, only 6 out of the 22 choose to solely trust the human. This number further decreased to 3 out of 22 when asked this question, where we informed participants that at some day the human was biased against one sex. Although it can be argued that this does not reflect inconsistency, depending on how the formulations are understood, it might point towards that people are easily affected by the formulation of the questions.
An actual logical inconsistency is found when people are asked to rank the three fairness criteria discussed previously. First, the participants are asked to rank the three criteria formulated as rules shown in the answer-options in this question according to importance for achieving fairness. Immediately thereafter, they are asked to rank what would be the worst case of unfairness that could occur (see the question below). Here, the options are formulated as cases that do not comply with one of the three criteria – all three discriminating the same sex.
We expect that the ranking of the three criteria should be the same for each participant between the two questions. However, 57% of the participants do not keep the same ranking between the three criteria, when they are formulated in different ways. We observed no significant difference between the comprehension of participants who rank consistently versus the group of participants who rank inconsistently (
... we all know the feeling of missing some point in Marvel’s The Infinity Saga, so let us take a recap. Based on the survey results, we can conclude that people prefer the criterion Equalized Opportunity in the super figure setting. But even though this might seem like a concrete answer to the question of fairness, it still does not seem to satisfy it completely. More generally, the survey results point towards people being sensitive to formulations of fairness and unfairness, and in some questions even to a degree that demonstrates inconsistency in their answers. In addition, the survey results did not prove a statistical significance in the association between people's self-reported understanding and their comprehension measured through their answers to comprehension questions. In summary, the results point to the need of being careful in the debate about fairness, since formulations and wrongly estimated self-understanding may skew it.
… and zooming out of the academic sphere of mathematical formulations. We will now look at the fairness challenges in practice. The fairness literature is focussing on static settings which often do not resemble the challenges faced by practitioners in industry
In general, the problems in practice usually include dynamic scenarios instead of one-shot classification tasks, like web search, chatbots and systems that employ online learning, reinforcement learning or bandit problems
One idea, proposed for models in a Natural Language Processing setting is an analogy to traditional “software testing”, or more precisely behaviour testing or black-box testing
Lastly, it is worth mentioning that a general position within the field is that the challenges can not be solely solved technically. Instead we require cross-disciplinary collaboration between computer science, law, social sciences, and humanities, which for example the ACM Conference on Fairness, Accountability, and Transparency is aiming for. It is also discussed in
… as every movie must come to an end so does our article. We have provided an overview of different ways to evaluate and define algorithmic fairness and bias, as well as different methods of mitigating bias throughout the ML life-cycle. We then discussed challenges with the currently proposed approaches with respect to people’s comprehension and perception, and application of fairness mitigating methods in practice. As discussed in the previous section, the work on algorithmic fairness is far from done. We, therefore, expect more work on algorithmic fairness in the future and hope that this article will inspire new research in the field. Like it is with superhero movies, the next one is usually already waiting around the corner.
We are grateful to Marie Rørdam Fenger for believing in and supporting us in this work. We would also like to thank Anders Ringsborg Hansen for his work on the article's interactive questions. For help with the survey in the form of feedback or distribution, we would like to thank Kasper Fænø Bay Noer, Julie Rasmussen, Kristian Tølbøl Sørensen, Mads Schaarup Andersen, Bente Larsen and Lisa Lorentzen. We are indeed also thankful to all the people who took the time to participate in the survey.
This research is partly supported by a performance contract allocated to the Alexandra Insitute by the Danish Ministry of Higher Education and Science.
The interactive box listing fairness definitions is inspirated and based on this distill article.
Both authors contributed equally.
The conduction of the survey: The survey was conducted in December 2020 - January 2021, and the participants were volunteers recruited through social media focusing on accounts/pages with a Computer Science background. The participants were motivated by the fact that they contribute to research on Fair AI and by the possibility of winning a symbolic prize. In total, 92 people participated. The participants were asked for a few demographic information which yielded 40% male, 60% female, of which 88% are Danish. They were asked to self-report their level of experience in "statistics and/or machine learning" and in "ethics and/or legal philosophy" on a five-point Likert-scale. The participants were more experienced in statistics/machine learning with a median of 3 ("Moderately experienced") than in ethics and legal philosophy with a median of 2 ("Slightly experienced").
The statistical test: Different statistical tests are reported in the paper. This part of the appendix elaborates on the choice of tests and the assumptions required.
A Mann-Whitney U test
However, the variables fulfil the Mann-Whitney U test assumptions since they are ordinary and independent. We test the null hypotheses H0: The distribution of comprehension scores between two criteria is equal. We choose a significance level of
The paper also reports the Spearman rank-order correlation coefficient
Details of data preparation and training
A simple classifier is trained to distinguish between hero and villain. The data is from the Kaggle superhero dataset which is a collection of information about super figures extracted from the Superhero Database. The data contains information about the super figures names, characteristics and possession of superpowers. Only super figures being either villain or hero and either male or female are considered in this setup. This yields 613 super figures which are randomly split into a train and an evaluation set (30%). The following input features are used in training: height, weight, one-hot encoding of 167 different powers and the names of the super figures transformed into count vectors of n-grams 3 and 4. In the variables height and weight negative values are replaced with mean values conditioned on the gender. Note the gender is not used as an explicit feature during training, ensuring fairness through unawareness. In the example, gender is considered a protected attribute. The classifier is trained using the scikit-learn