How does the computer become a just superhero?

A review of fairness in machine learning

Machine learning models should ideally be perceived more as righteous superheroes than as unjust villains. But how do we ensure that our models are being classified and represented as such? Or rather, how do we ensure models to be fair and just in their outcome when they have an impact on humans — like every other superhero movie, the goal is to save humanity. This paper gives a broad introduction to the concepts, aspects and challenges of fair machine learning, as well as discussing results from a small survey conducted about people’s perception and comprehension of algorithmic fairness. We use 'superheroes' as an illustrative analogue for fair models. Through interactive illustrations we invite the reader to take a stand and check their comprehension about fairness in algorithms.

Prologue

Please introduce us

... to this new universe of fairness. Data-driven models like machine learning (ML) models are more and more applied in society See for example the city of Helsinki's and Amsterdam's AI registry, where they show how AI systems are used within their municipality. within a range of different areas and tasks from recommender systems and web search , over hiring to criminal courts and also in health care . With the rise of machine learning models, there has been an increased focus of fairness and transparency within the field . Several cases of unfair algorithms have also been debated in the media. One example is a United States criminal risk assessment system, COMPAS, used in courts for parole decisions. The system assigns a risk score of recidivism, i.e. the likelihood that a person will recommit a new crime. It was discovered by ProPublica in 2016 that black people, who truly would not commit a new crime, were more likely to get wrongly assigned a higher risk score than white people, and thereby getting a lesser chance of parole . Another example showed an underrepresentation of women and gender stereotypes in occupations in image search . It also became a case in the news, where the first female to appear in a Google Image search on 'CEO' in 2015 was a barbie doll dressed in a suit appearing after several rows of white males. Other cases debated in the news are e.g. the case of racial image cropping on Twitter in 2020 For example shown in The Guardian and BBC News. or gender bias in recruitment See articles on Reuters and The next web.. These cases seem problematic because they are reproducing a cultural and historical bias present in society, respectively discrimination of black people and inequality between genders in the labour market. This was advocated for by Kate Crawford in her keynote The Trouble with Bias at NIPS 2017 . With machine learning becoming more prevalent in contexts where it affects people’s lives, we are potentially looking at a systematic and automatic way of reproducing discrimination or favouritism. This is the unjustness our super models need to prevent. Fighting this unjustness have resulted in a lot of research that mathematically formalizes and addresses unfairness, e.g. summarized in . However, notions of fairness change over time and culture. So the question (to which likely no aboslute answer exists) still remains: What is a fair model?

The plot

… of this paper will circulate the question of what a fair model is by presenting an illustrative introduction to fairness in machine learning. The first scenes in the prologue have already introduced the theme, and from here, the storyline will take us through why bias occurs, how fairness can be defined and bias mitigated, all the way to perceived algorithmic fairness. The main contributions of this interactive article are two-folded. First, it summarises the current state of fairness in machine learning through a literature review presented as an interactive paper that encourages people to think about the issues and challenges with algorithmic fairness. Second, a conducted survey examines people’s comprehension, self-reported understanding and perception of different fairness criteria in a superhero setting. It further touches upon how different formulations impact people’s opinions.

We will focus on explaining the following three mathematical fairness criteria: 1) Demographic Parity (DP), 2) Equalized Opportunity (EO) and 3) Predictive Parity (PP). These criteria are also used in the survey’s examination of both comprehension and perception of fairness. Questions from the survey will be presented to the reader throughout the paper. The survey details are outlined further in Appendix A and the infobox below summarises some brief details about it and introduces the interactive questions in the paper:

We invite you to take a stand on algorithmic fairness during the read by answering the questions. You can start with the following:

The survey, like this paper, uses a setting about super figures as an example. We chose this fictional setting to give a story easy to relate to. At the same time it avoids a true or realistic case to which people might be biased from the beginning, such that the essence can be algorithmic fairness in general and not focused on the actual case. The setting is formulated along the lines of a set of super figures, who can be either heroes or villains, all wanting to go to a party. They are assessed at the door, e.g. by an algorithm, since only the figures who are “believed” to be heroes are allowed in. The focal question is how to ensure that both the group of male and female figures are treated fair, as well as what fairness means in this setting.

Data for an example case was extracted from the Kaggle superhero dataset and used in the paper for illustrative purposes. A simple classification model was trained on the dataset to distinguish between villains and heroes where features such as superpowers, names, height and weight are used. Read more about pre-processing, feature selection and model training in Appendix B.

From hereon, the paper is divided into four main sections. The first section focuses on understanding the notion of bias, the second section describes definitions of fairness and methods to mitigate bias. The third section discusses the perception of fairness and highlights results from the survey. The last section sums up the remaining challenges and concludes the main points of this paper.

The notion of bias

It is the suit

… which gives identity. Kate Crawford presented in a division of the notion of bias into the harm of allocation and harm of representation. Examining why representation bias can be harmful, invites us to think of the ‘CEO’ image search example – and the important role the suit has for a super figure. How people are presented as groups and individuals has an impact on how they are perceived by themselves and others, which in turn links to their identity . Therefore, it does matter when there are no images of female CEOs represented in an image search. Other examples of representational bias are: 1) facial recognition systems working poorer on faces with darker skin tone , 2) gender stereotypes in occupations in word embeddings , 3) Google Photo labeling two black people as ‘gorillas’ For example reported by News.com.au and The Verge. . Harm of representation can also occur in recommender systems, e.g. not showing ads about the new superhero film to certain profiles with a specific gender, even though gender is not a factor for whether one would like the movie or not. The issue is, therefore, not only that the performance of a system differs between groups, but it is also how it fails that should raise concerns.

barbies
This is to illustrate that the suit and representation of humans matters. The CEO barbie was the first female to appear on a Google image search on ‘CEO’ in 2015. Image credit: The Onion . The barbie dressed as supergirl, image credit: Amazon .

The division of powers

… can be denoted as an allocation problem. How resources and opportunities are allocated can be skewed and cause harm to a group, e.g. in loan applications or parole decisions . In an allocation problem, there is often one beneficial outcome, which we will denote as positive, versus a disadvantaged outcome denoted as the negative outcome. The focus is on different groups, which traditionally have been gender or race, and the issue is to protect the members of a (minority-)group from discrimination or unfair treatment. We denote such notion of belonging to a group as protected attributes. A group subject to disadvantageous treatment is denoted the unprivileged group. When using data and statistics for decision-making, there can be anti-discrimination laws a system must comply with, which for example could require not to use any protected attributes in a system . In our super figure example it relates to not fitting the classifier on gender, i.e. excluding gender from the set of input features. In the literature, this is described as fairness through unawareness . However, it often does not solve the bias problem, since the information of a protected attribute can be latently present in other variables leading to indirect discrimination . In our example, imagine that gender is correlated with other variables such as weight and height, i.e. given the weight and height it is possible to infer the gender. Also legally, systems that unintentionally or intentionally use a proxy for the protected variable can be prohibited by law. For example denoted as indirect discrimination in the EU law and disparate impact in the US law .

Another dimension

… of the bias notion, would be not to look at the potential harm but at the source of its occurrence. Mitchell et al. define the terms statistical bias and societal bias. The first term, also called representation and aggregation bias in , refers to errors in the collection of data, analysis, and model development in a way where reality is not represented properly. In our setting, we could for example accidentally only focus on super figures of the DC Comics and thereby not representing the whole range of super figures in the comic universe. Societal bias or historical bias refers to social structures that are, in fact, represented by the data but are perceived as unfair. This is an important distinction because discriminating a group might not be an issue in the model but instead caused by an underlying structure in society. However, it could also be the case that societal bias is manifesting itself into the model in the form of stereotyping . The Issue is well explained in a blogpost by Valerie Carey through a “money lending” scenario, where the system is accused of discriminating against females. For the sake of the example, take it as a fact that women on average have a lower income than men due to a societal bias caused by discrimination in education or workplace. The question is whether the model itself is discriminating against women when it gives them fewer or smaller loans. An important distinction is whether the model is fitting on e.g. income – which would be fair since it is reasonable to assume that income holds information about the ability to pay back a loan – or if the model instead is fitting on some proxy for gender, and thereby learning that females, in general, have lower income. The latter will result in an unfair treatment (stereotyping), since the fact of being female affects the chance of getting a loan independent of the income.

Both dimensions of bias, characterised either by the impact or the occurrence find coherences in Olteanu and others examination of social data . Here, the authors present a diagram that represents the sources of occurrences, it´s manifestations and issues. They divide bias into many subcategories (6 main categories), as also seen in (23 types) and (6 types). We have seen that there can be many different types of biases in algorithmic decision making, but how do they occur? We will look into it in the next section.

The villains are

… the data, the algorithm, and the user interactions . Data is usually the first suspect of bias introduced into a model, but all three inflict each other. To list some of the “offences” by data: 1) a skew in the data collections , 2) missingness in attributes, 3) a skew in human annotations or 4) an imbalance in examples. To stay in the super figure setting, one could imagine the case where the police historically have examined more male figures than females and therefore have found more villains among men. This will then be reinforced over time as men will be more likely suspected to be villains than women. Missingness could occur in self-reported attributes that are missing for a reason related to a protected attribute, e.g. if one gender more often prefers not to state weight. Discrimination can also be induced into the data by human annotations. For example, if the annotators prefer Marvel over DC Comics, their judgement when annotating might be biased. Unbalanced training data can also introduce a skew into a model. In our super figure toy data from Kaggle, we have an unbalanced dataset with fewer females than males (Figure 1), making it potentially harder for a model to generalise well on the classifications of female villains.

An undesired skew can also be introduced into the model by the choice of the algorithm or evaluation scheme : imagine that one type of decision boundary is more suited for one group than the other. Last but not least, how the model is set into production and how users interact with it can in itself be a source of bias, e.g. when the models' suggestions are followed in a skewed way . Knowing now that both humans, models, and the way models are being used can be biased, we invite you in the question below to think about how you prefer to be assessed if you were a super figure wanting to go to the party. We will return to this question in the section Perceived Algorithmic Fairness. But first, we need more formal definitions of what it means for a model to be fair as definitions are essential to be able to identify unfairness in a systematic way. The next section, therefore, presents different ways of defining fairness and methods to achieve it.

Defining fairness

Joining the Justice League

… by stating what we mean with a just and fair model. Fairness is defined in the Cambridge Dictionary as the quality of treating people equally or in a way that is right or reasonable. What it means to treat people right and how it can be operationalised with respect to algorithmic decision making is what the fairness literature is trying to determine. However, this is no easy question to answer. Looking back at the "CEO" image search example, it is not clear what a righteous result should yield. Kate Crawford, for example, raised the question if the proportion of male and female images (and images from different ethnicities) should resemble the current statistic of people working as "CEO", or whether it should be what people think is the right proportion. After all, image search is contributing to shaping our perception of reality .

The other case from the introduction about criminal risk assessment is heavily used as an example in the literature, e.g. . It is an instance of outcome fairness, where the goal is to determine how a desired outcome should be righteously distributed among groups or individuals. For example, Angwin et al. found the system to be biased against black people, since the rate of people who would not reoffend and wrongly received a high risk score, was higher in that group. However, in the response from the developers of COMPAS it was concluded that the system is not biased against black people, since the Predictive Parity is the same between the two groups , i.e. the probability of recidivating given a high risk score is the same in both groups. The problem here is that these two measures are not necessarily compatible and it, therefore, opens a discussion of which of the two is most fair.

These two examples illustrate the challenge in defining and deciding what a fair model is. Clear definitions of bias are essential to be able to spot and identify biases in the first place — without it, it is not clear what to look for when examining systems for potential biases. Especially in the cases of representational harm, it can be hard to discover undesired model behaviour and defining a more righteous behaviour. But an example were several attempts have been made is regarding identifying and removing stereotypes in word embeddings, e.g. . In , the authors compared occupational stereotypes in word embeddings with human perception of occupational stereotypes through an Amazon Turk survey and made an attempt to mitigate the model’s bias accordingly. However, it is not feasible to perform a perception survey for every model, so it is not a method that scales in practice. Furthermore it does not remain robust through time and cultural change, as perceptions may change.

Regarding outcome fairness, a lot of work has been conducted on formalising fairness mathematically . In the literature over 70 different metrics have been defined, see e.g. IBM’s AI Fairness 360 library. In the following, we will look into mathematical fairness criteria for allocation problems and their trade-offs.

Listing the superpowers

… of mathematical fairness. Mathematical fairness criteria with respect to classification problems, where a certain treatment or outcomes are the desired output, can be grouped together in different ways. In the following box we will highlight five ideas on formalizing algorithmic fairness partly based on :

Batman begins

… is a great movie, and like Batman every super figure needs to learn to master their powers in the beginning. We will, therefore, start by understanding the most known statistical measures of fairness (Demographic Parity, Equalized Opportunity and Predictive Parity), before we discuss the advantages and disadvantages of the statistical approach.

As described in the box above, statistical measures are derived from the so called confusion matrix. The confusion matrix is a way to sort all examples in a dataset according to their true and predicted class in a binary classification setting. The following table shows a confusion matrix and its components, and describes the terminology needed for the definition of the specific statistical fairness criteria:

Table 1: Confusion matrix for a binary classification problem with a positive (typically defined as 1) and a negative (typically defined as 0) outcome. Hover or click on a particular cell to get a more detailed explanation of the corresponding term.

To define different statistical fairness criteria, we will use the following terminology:

Demographic Parity

Demographic Parity (DP) also called group fairness or statistical parity incorporates the idea that non-discrimination between groups is achieved when the chance of getting a positive outcome is equalized across groups. Given a binary classification task and two groups this can mathematically be formalized as:

P(\hat{Y} |G=0)=P(\hat{Y} |G=1).

It can also be thought of as achieving the same Positive Rate (PR) for each group. The PR is calculated as the number of positive predictions divided by the total number of members in each group:

PR = \frac{TP+FP}{TP+FP+FN+TN} .

In our super figure example this means that the fraction of females and the fraction of males getting accepted to the party should be equalised, or in practice, be similar. In the survey, we investigated people’s comprehension of this fairness criterion. You can check your understanding as well with the following questions:

The last question points to the critique to Demographic Parity by Hardt et al. , where it is pointed out that the metric does not account for a case where the true outcome is correlated with the protected group. Such a case could force us to accept non-qualified figures in one group or dismiss qualified figures in the other group to achieve Demographic Parity. The criterion can, however, be justified, if we are looking at a societal bias we actively want to change. Imagine a case of young super figures getting admitted into an academy. Furthermore, assume that young male figures are less qualified than females. However, we want to enforce a policy of even acceptance rate, i.e. adjust for this skewness, since we believe it is a structural bias in society. Note, that if the males are less qualified due to societal bias, it is not a good solution only to adjust the acceptance rate. There is also a need to improve the males' lesser qualifications on acceptance to match what skills are truly needed. This could be done, for example, by providing tailored introduction classes. Further discussions on the moral justification and arguments in favor of Demographic Parity can be found in Hertweck et al. and Räz respectively.

In our example, where we trained a classifier to distinguish “hero” or “villain”, we can, on the evaluation set, look at the difference in the Positive Rate for males and females for different threshold values. Try to achieve parity for the Rositive Rate by choosing different threshold values in the figure below:

Equalized Opportunity

Equalized Opportunity (EO) and Equalized Odds is proposed by Hardt et al. as an alternative to Demographic Parity. The idea behind Equalized Opportunity is that people who in fact should receive a positive outcome have an equal chance of receiving it, independent of their group membership. It is a relaxation of the Equalized Odds criteria, which can be formalized for a binary predictor and in the case of binary group membership as,

P(\hat{Y} =1|Y=y,G=0)=P(\hat{Y}=1 |Y=y,G=1).

Relaxing the formulation to y = 1, where a value of 1 represents the positive class, defines the criterion of Equalized Opportunity. This can also be formulated as requiring an equal True Positive Rate (TPR) between the groups. The TPR is calculated as the number of correct positive predictions divided by the total number of actual positive members in the group:

TPR = \frac{TP}{TP+FN} .

In our super figure example, this means that the chance of getting accepted to the party when you, in fact, are a hero should be the same for both males and females. You can verify your own understanding of Equalized Opportunity with the following questions:

Looking at our superhero classifier´s performance on the evaluation set, we can experiment with satisfying the parity of the True Positive Rate by setting different threshold values for the two groups. In they formalized this procedure as an optimisation problem. This approach is an example of a post-processing algorithm to achieve fairness by adjusting the output labels given a trained model. Try to achieve Equalized Opportunity in the following Figure, and you will see, that you can achieve it without simultaneously achieving Demographic Parity:

Predictive Parity

Predictive Parity (PP), also described as outcome test, is a statistical measure which requires that the probability of a correct prediction of the positive class is the same for all groups . For a binary predictor it can be defined as

P(Y =1|\hat{Y} =1,G=0)=P(Y=1 |\hat{Y}=1,G=1).

It can also be thought of as achieving the same Positive Predicted Value (PPV) between the groups. The PPV is calculated per group as the number of correct positive predictions divided by the total number of predicted positive members in a group:

PPV = \frac{TP}{TP+FP} .

For our super figure examples, this means that the chance of a correct prediction for figures allowed into the party should be the same for both males and females. Once again, you can try verifying your understanding and reflect on Predictive Parity with the following questions:

Looking at our super figure classifier's performance on the evaluation set, we can experiment with satisfying the parity of the Positive Predictive Value by setting different threshold values for the two groups and see how this affects both Demographic Parity and Equalized Opportunity:

The hero's path

… is not going to be easy. In the previous section, we saw that it is not possible to simultaneously achieve Demographic Parity, Equalized Opportunity and Predictive Parity in the super figure example, unless we almost let no one into the club. This is not a special case. It is proven that, except in trivial cases, many common known notions of fairness are incompatible with each other and furthermore conflict with optimising accuracy .

Others criticize that the statistical approach only gives guarantees for the average of a group and not for individuals or even subgroups . Imagine a case where we only accept female figures from the DC Comics and males from the Marvel universe. Here, we could set up the scenario such that we comply with a fairness criterion for females versus males and DC figures versus Marvel figures, but without having a fair treatment for males from DC Comics or females from the Marvel universe. This problem is showcased by both in form of a toy example and in the form of a real case. The authors suggest a framework to learn a fair classifier for a rich set of subgroups to alleviate the problem. However, it does not overcome the fact that a division of people into groups is required.

Another criticism is the stationary property of statistical measures . Liu et al. made one step simulations for a lending scenario and revealed that complying with Demographic Parity or Equalized Opportunity could lead the protected group to be worse off with respect to their underlying credit scores one step into the future. At the same time, a policy of unconstrained maximisation of profit for the “leading company” would never lead to a scenario where the protected group was worse off one step in the future. D’Amour et al. expand the study by examine several settings and more long-term simulations, and find that fairness differs when looking at long-term dynamics compared to single-shot classification.

Yet another critique comes from which demonstrate that fairness metrics cannot be used to distinguish if a model is stereotyping members of a group. The metrics do not reveal if the model fits on reasonable attributes, like income, when the bias is not apparent in the model but introduced elsewhere in society, e.g. that women receive lower income than men.

Despite its caveats, the strength of statistical measures of fairness is the ease of computing them and if applicable, they are also easy to achieve, e.g. by adjusting the classifier’s threshold. Both the verification and adjustment do not require assumptions of the data, in contrast to for example individual and counterfactual fairness. However, choosing and interpreting any disparity metric requires understanding the bias occurrence and a discussion of whether an adjustment of the model to meet the parity is the desired outcome.

We will return to the three statistical metrics in the section Perceived Algorithmic Fairness to examine how people perceive them and investigate people’s comprehension based on the survey results. But first, let us look at methods for mitigating biases.

Circulate the planet

… to defend it with all our superhero powers. Until now this paper has listed different ways of defining fairness and especially, there has been a focus on statistical measures in the form of three parity metrics. But now it is time to show the weapons for mitigating and monitoring biases. In the interactive diagram below, we illustrate the life-cycle of a machine learning system with its different phases and highlight methods to intervene or mitigate biases, which is revealed when clicking on a phase. A similar approach was presented by Suresh and Guttag, where they assign six types of biases and their mitigation approaches to different phases in the ML life-cycle or Bellamy et al. where algorithms are divided into pre-, in- and postprocessing algorithms . Mehrabi et al. also present a review of different mitigation approaches and Friedler et al. provide a comparison . Note, that the typical life-cycle is circular and even though a system is in production, i.e. in the “Model Monitoring'' phase, it might be necessary to revisit other phases to deal with unwanted behaviour or model drift.

Perceived algorithmic fairness

What is your favourite super power

… or in other words, which is your favourite fairness criterion?

The question of which statistical fairness measure people perceive as more fair has previously been addressed in the literature, e.g. . Both Srivastava et al. and Harrison et al. examined trough Amazon Turk surveys which statistical fairness measure people perceived as most fair. The two studies used different experimental designs, respectively comparing ten examples “figures” and comparing histograms of different “algorithms”. However, the question can be raised, whether ten example figures are enough to generalise to a systematic bias and whether the participants fully understood the implication of the parity in the models. Saha et al. addressed the question of peoples comprehension of statistical measures, and showed that comprehension can be measured through a multiple-choice survey. They found that in a hiring scenario, Equalized Opportunity was harder to understand than Demographic Parity, and that in general, comprehension was correlated with the participants’ education level. An interesting finding was a tendency of people with low comprehension scores to express less negative sentiment towards a criterion. This survey’s design, where criteria are expressed as rules and questions are asked through multiple-choice, has inspired the survey conducted in this paper. The objectives were to measure the comprehension and link it to the perception of fairness, and to examine the consistency in the answers and people´s self-reported understanding. In the following, we present the results of the survey.

A measure of comprehension

To measure the participants’ comprehension of each criterion, a score is computed as the percentage of correct answers to the comprehension questions (four questions per criterion). This yields an indication of the participants' understanding of the fairness criterion and it allows for a comparison of the comprehension between the three criteria. Figure 2 shows the distributions of the participants’ comprehension score on each fairness criterion.

boxplot_comprehension_score
boxplot_raported_score
Figure 2. Left: Boxplot showing the distribution over participants’ computed comprehension score for each criterion. Using a Mann-Whitney U test to test the null hypothesis that the pairwise distribution of comprehension scores are equal yield the following p-values: DP-EO: 0.020<0.05 , DP-PP: 0.007<0.05 EO-PP: 0.305 \nless 0.05 . Hence, with a significance level of 0.05, we can reject the null hypothesis and conclude that DP is easier to understand than both EO and PP. Right: Boxplot showing the distribution over participants’ self-reported comprehension score. We use the same test statistic to test for differences in distributions, resulting in the following p-values: DP-EO: 0.010<0.05 , DP-PP: 0.000<0.05 EO-PP: 0.008<0.05 . Hence, the difference is significant for all pairs.

The distributions indicate that participants found Demographic Parity easiest to understand. A pairwise performed Mann-Whitney U test reveals that the difference is significant between Demographic Parity and the two other criteria (p=0.020 for Equalized Opportunity and p=0.007 for Predictive Parity), but not between Equalized Opportunity and Predictive Parity. Therefore, we can conclude that Demographic Parity is easier to understand than both Equalized Opportunity and Predictive Parity in this survey setting. This supports the finding in that Equalized Opportunity is harder to understand than Demographic Parity, albeit the different setting.

We also asked the participants to self-report their understanding of each criterion after answering the comprehension questions using a five-point Likert-scale (see an example in this question). Each point is assigned a numeric value, where a higher value indicates a higher self-reported understanding. The distributions over the self-reported understanding is presented as boxplots in Figure 2. The pairwise comparisons all show a significant difference between self-reported comprehension (DP-EQ: p=0.010, DP-PP: p=0.000, EQ-PP: p=0.008), meaning that people report different levels of understanding for the three measures. Similar to the computed comprehension scores, the self-reported comprehension has the following order: 1.) Demographic Parity, 2.) Equalized Opportunity and 3.) Predictive Parity.

We examined whether there is an association between the average self-reported and the average computed comprehension score. A calculation of the Spearman correlation ( \rho=0.198) and the associated p-value (0.058) does not show a strong correlation between the self-reported and computed comprehension score (see also Appendix A). Hence, we can not conclude a significant association between how people perceive their understanding and how well they actually understand the criteria. No clear trend towards over- or underrating oneself was found in the data. We conclude by noting that it is difficult to discuss fairness if people are not aware of their own understanding.

The opinion towards the criteria

After the comprehension questions of each criterion, the participants were asked to state how fair they perceive the criterion on a five-point Likert-scale (see an example of the formulation in this question), i.e. they were asked to evaluate the fairness of each criterion independent of the other criteria. The Likert-scale is transformed to numeric values, and the boxplots in Figure 3 shows the different distributions. The fairness assessment resulted in the following ranking: 1.) Equalized Opportunity, 2.) Predictive Parity, 3.) Demographic Parity.

 boxplot_fair_score
Figure 3. Boxplots showing distributions over participants' reported opinion on each criterion when asked on a 5-point Likert-scale whether they think the criterion was fair to use. Using a Mann-Whitney U test to test the null hypothesis that the pairwise distribution of perception scores are equal yield for DP-EO a p-value of 0.000 < 0.05, for DP-PP a p-value of 0.001 < 0.05 and for EO-PP a p-value of 0.022 < 0.05. With a significance level of 0.05, it can be concluded that the distribution of perception scores are different for all pairs .

In the survey conducted in , Saha et al. found that people with low comprehension tend to have a less negative opinion towards a fairness criterion. To investigate if the same trend is visible in the survey conducted here, we calculated the Spearman's correlation between perception and the computed comprehension score for each criterion. The results are shown in Table 2.

Table 2. Spearman correlation between comprehension and perception scores for three different statistical fairness criteria.
*Significant correlation with a p-value below 0.05.

For Demographic Parity we can see a significant, negative correlation between comprehension and perception, i.e. the better people understand the criterion the less they perceive it as fair (see Table 2). This is in accordance with the results from Saha et al. . A weaker negative correlation can also be seen for Predictive Parity, but it is not significant (p=0.177 \nless 0.05). In contrast, for Equalized Opportunity, there is a small positive correlation between comprehension and perception (not significant with p=0.248 \nless 0.05), i.e. people who understand the criterion seem to like it more.

Superheroes live in a multiverse

… and so it seems to be the case with algorithmic fairness. Instead of only looking at which statistical measure is perceived as most fair, we need to broaden our understanding of the fairness universe. As we discussed earlier, bias can be introduced when interacting with the system, and hence the interaction with the system can in turn also affect the perceived fairness of the system. In general, there seems to be more to fairness than mathematical formulations, e.g. . For example, Grgić-Hlača et al. suggest shifting the focus from looking at distributive fairness to procedural fairness, which instead of looking at the outcome focuses on the process that leads to an outcome. They operationalise a part of the idea by examining which input features people perceive as fair when used in different scenarios. In our super figure example, we could easily imagine that this has a great impact on what is perceived as fair. Here, it is probably perceived fair to look at superpowers, since we do not want a dangerous cocktail. However, weight and height would seem less relevant and therefore perceived as discriminating to use, despite the fact that it could increase the model's accuracy. The study of Wang et al. also considers the development procedure. They examined how different factors affected the perceptions of fairness such as development procedure, the outcome for the individual and bias at a group level. They found that people are more sensitive to getting a favoured outcome themselves than weighting bias against a group. From Saxena et al. it sems that historical bias could influence perception. To sum up, group bias in a model, development procedure, personal position, historical influence — all seem to be some of the factors that influence the perception of fairness in a system. In addition, the survey of this paper looks upon how actual formulations may impact opinions about fairness. In the next section, we will discuss some of the results from the survey which look at other factors that could impact fairness perception.

The preference for human or algorithmic judgement is subject to change

Human or machine, what would people prefer to be assessed by in different settings? This question was also raised by Harrison et al., who found a slight preference towards human judgement over a machine learning model . In our survey we investigated people’s opinion about using algorithmic or human judgement in the super figure setting, and explored how additional information can influence people’s opinion using the following three questions (the first question is replicated in this question and the two others are shown below):

The answers show that, initially, the majority (77%) of participants preferred the option of a “human judgment supported by an algorithm” when it is assumed that both the algorithm and the human act reasonably (see this question). However, this changes with the extra information that occasionally the human is biased against one sex. Under this assumption, the majority (56%) of participants preferred the algorithm, even though it is still, in general, assumed that both are acting reasonably. The third question assumes that the algorithm is much more accurate than the human but also biased. This shifts the preference of the majority (71%) back to preferring “human judgment with algorithmic support”. Our survey consists of a small sample size, and the representativeness is not accounted for. Nevertheless, it is interesting how relatively easy people’s opinion can be changed by adding some weak information about the system. In fact, 54% of the participants changed their choice throughout the three questions. One reason could be the human cognitive bias to weight weak information, to which we have an emotional positive or negative feeling (here the human's or algorithm's “bias”), higher than objective information (here the human's or algorithm's accuracy) .

Participants are not consistent throughout their answers

In addition to people’s preference for human or algorithmic judgement, the survey also asked a more high-level question (see this question) regarding whether it is fair to use “a system that uses data and statistics” in the super figure setting. To this question, 22 out of 92 participants answered a clear “no”. But when asked about whether they preferred human or algorithmic judgement, only 6 out of the 22 choose to solely trust the human. This number further decreased to 3 out of 22 when asked this question, where we informed participants that at some day the human was biased against one sex. Although it can be argued that this does not reflect inconsistency, depending on how the formulations are understood, it might point towards that people are easily affected by the formulation of the questions.

An actual logical inconsistency is found when people are asked to rank the three fairness criteria discussed previously. First, the participants are asked to rank the three criteria formulated as rules shown in the answer-options in this question according to importance for achieving fairness. Immediately thereafter, they are asked to rank what would be the worst case of unfairness that could occur (see the question below). Here, the options are formulated as cases that do not comply with one of the three criteria – all three discriminating the same sex.

We expect that the ranking of the three criteria should be the same for each participant between the two questions. However, 57% of the participants do not keep the same ranking between the three criteria, when they are formulated in different ways. We observed no significant difference between the comprehension of participants who rank consistently versus the group of participants who rank inconsistently (p=0.061 \nless 0.05), despite a slightly higher average comprehension of the consistent group (0.75 vs. 0.68 ).

What just happened?

... we all know the feeling of missing some point in Marvel’s The Infinity Saga, so let us take a recap. Based on the survey results, we can conclude that people prefer the criterion Equalized Opportunity in the super figure setting. But even though this might seem like a concrete answer to the question of fairness, it still does not seem to satisfy it completely. More generally, the survey results point towards people being sensitive to formulations of fairness and unfairness, and in some questions even to a degree that demonstrates inconsistency in their answers. In addition, the survey results did not prove a statistical significance in the association between people's self-reported understanding and their comprehension measured through their answers to comprehension questions. In summary, the results point to the need of being careful in the debate about fairness, since formulations and wrongly estimated self-understanding may skew it.

The challenges ahead

Leaving the superhero universe

… and zooming out of the academic sphere of mathematical formulations. We will now look at the fairness challenges in practice. The fairness literature is focussing on static settings which often do not resemble the challenges faced by practitioners in industry . Summarizing, for example, some points from Holstein et al. : 1) It is often assumed that the dataset is static and fairness can only be achieved by changing the model. However, in practice data collection can be changed and many practitioners point towards the idea of investigating and mitigating bias already at that step. Nevertheless, as seen in the machine learning cycle, the current literature mainly focuses on mitigating bias during the pre-processing or training steps. 2) The debiasing methods and metrics often do not apply in an actual context, e.g. because sensitive attributes can not be accessed on an individual level. This can be addressed by building a classifier to predict the sensitive attribute to enabling the evaluation of fairness metrics, e.g. see the work of for a discussion of this approach.

In general, the problems in practice usually include dynamic scenarios instead of one-shot classification tasks, like web search, chatbots and systems that employ online learning, reinforcement learning or bandit problems . For example, Dwork and Ilvento show that even though models independently satisfy a group’s fairness metric, it is not given that a system that combines the models will, and D'Amour et al. show that achieving fairness in one-shot systems differs from achieving it in long-term scenarios. There is, therefore, a need for more methods to audit and monitor complex dynamic systems.

One idea, proposed for models in a Natural Language Processing setting is an analogy to traditional “software testing”, or more precisely behaviour testing or black-box testing . The idea is to examine the model’s behaviour by providing a set of input-output tests without having access to the model itself. This can be used to check a model’s fairness in different scenarios without the need of metrics, due to the focus on concrete example cases. For example in , it was proven through a corpus of test sentences that several sentiment analysis systems contain significant bias against gender and race. This was achieved by systematically testing the system's output on sentences where the only word changed is gender or race-related.

Lastly, it is worth mentioning that a general position within the field is that the challenges can not be solely solved technically. Instead we require cross-disciplinary collaboration between computer science, law, social sciences, and humanities, which for example the ACM Conference on Fairness, Accountability, and Transparency is aiming for. It is also discussed in how different disciplines can work together to solve challenges of fairness and bias in algorithms. Furthermore, Barabas et al. point out that to solve issues of discrimination require to include changing the way systems are designed and to encourage data scientists to reflect on their choices .

To be continued

… as every movie must come to an end so does our article. We have provided an overview of different ways to evaluate and define algorithmic fairness and bias, as well as different methods of mitigating bias throughout the ML life-cycle. We then discussed challenges with the currently proposed approaches with respect to people’s comprehension and perception, and application of fairness mitigating methods in practice. As discussed in the previous section, the work on algorithmic fairness is far from done. We, therefore, expect more work on algorithmic fairness in the future and hope that this article will inspire new research in the field. Like it is with superhero movies, the next one is usually already waiting around the corner.

Acknowledgments

We are grateful to Marie Rørdam Fenger for believing in and supporting us in this work. We would also like to thank Anders Ringsborg Hansen for his work on the article's interactive questions. For help with the survey in the form of feedback or distribution, we would like to thank Kasper Fænø Bay Noer, Julie Rasmussen, Kristian Tølbøl Sørensen, Mads Schaarup Andersen, Bente Larsen and Lisa Lorentzen. We are indeed also thankful to all the people who took the time to participate in the survey.

This research is partly supported by a performance contract allocated to the Alexandra Insitute by the Danish Ministry of Higher Education and Science.

The interactive box listing fairness definitions is inspirated and based on this distill article.

Author Contributions

Both authors contributed equally.

Appendix A: The survey

The conduction of the survey: The survey was conducted in December 2020 - January 2021, and the participants were volunteers recruited through social media focusing on accounts/pages with a Computer Science background. The participants were motivated by the fact that they contribute to research on Fair AI and by the possibility of winning a symbolic prize. In total, 92 people participated. The participants were asked for a few demographic information which yielded 40% male, 60% female, of which 88% are Danish. They were asked to self-report their level of experience in "statistics and/or machine learning" and in "ethics and/or legal philosophy" on a five-point Likert-scale. The participants were more experienced in statistics/machine learning with a median of 3 ("Moderately experienced") than in ethics and legal philosophy with a median of 2 ("Slightly experienced").

The statistical test: Different statistical tests are reported in the paper. This part of the appendix elaborates on the choice of tests and the assumptions required. A Mann-Whitney U test is applied in several cases: 1) to test for differences in the computed and self-reported comprehension score between pairs of fairness criteria and 2) to test for differences in the opinion towards different criteria. This test static is chosen since each variable's observation does not seem to be drawn from a normal distribution. Normality is examined visually through boxplots, QQ-plots and by performing a Shapiro-Wilk test . As an example, the computed comprehension score for Demographic Parity does not seem to come from a normal distribution when examining the two plots in Figure A1. Further, with a Shapiro-Wilk test statistics=0.837 and associated p-value=0.000, the null-hypothesis about normality can be rejected, and hence there is evidence that the data is not normally distributed. This check was repeated for the different rows of observations (fairness criteria).

boxplot_comprehension_score
boxplot_raported_score
Figure A1. Boxplot and QQ-plot for visual inspecting if the data (the observation for the computed comprehension score for Demographic Parity) comes from a normal distribution. The plot suggests that the data is indeed not normally distributed.

However, the variables fulfil the Mann-Whitney U test assumptions since they are ordinary and independent. We test the null hypotheses H0: The distribution of comprehension scores between two criteria is equal. We choose a significance level of \alpha=0.05. Obtaining a p-value under this level rejects the null hypothesis and concludes there is a significant difference in the distribution between the two tested variables.

The paper also reports the Spearman rank-order correlation coefficient between 1) computed and self-reported comprehension scores, and between 2) computed comprehension score and perception scores. This measure is chosen because the observations do not follow a normal distribution but they are ordinal. The associated test statistic is performed with the null hypothesis stating no correlation, and again with a significance level of \alpha=0.05.

Appendix B: The superhero classifier

Details of data preparation and training A simple classifier is trained to distinguish between hero and villain. The data is from the Kaggle superhero dataset which is a collection of information about super figures extracted from the Superhero Database. The data contains information about the super figures names, characteristics and possession of superpowers. Only super figures being either villain or hero and either male or female are considered in this setup. This yields 613 super figures which are randomly split into a train and an evaluation set (30%). The following input features are used in training: height, weight, one-hot encoding of 167 different powers and the names of the super figures transformed into count vectors of n-grams 3 and 4. In the variables height and weight negative values are replaced with mean values conditioned on the gender. Note the gender is not used as an explicit feature during training, ensuring fairness through unawareness. In the example, gender is considered a protected attribute. The classifier is trained using the scikit-learn implementation of logistic regression with the default values, except the regularization strength parameter is set to C=0.4 and ‘balanced’ is used for ‘class_weights’. The classifier obtained an accuracy of 0.72 and a macro-F1 of 0.66 on the evaluation set.