5. Parity Measures¶
We will now look specifically at preliminary notions of fairness applied to decision making systems powered by a supervised classifier. We begin with observational criteria: measurement of what exists and is observable. Observational criteria, like identifying differences the distribution of salaries between men and women, don’t assert anything meaningful about the nature of the discrepancy, about why such differences exist, or whether they constitute discrimination. On the other hand, identifying such differences is a first step in surfacing possible inequities and understanding why they exist (e.g. via a causal investigation).
In this lecture, we will focus on allocative algorithmic decision made by a supervised, binary classifier. Such a context provides a simple, yet important, example to begin understanding these ideas. Much of this perspective follows Chapter 2 of [BHN19].
We fix notation for this context, which will be used in the remainder of the chapter:
the observed characteristics of an individual (variables) be denoted by \(X\),
a category containing a salient group (e.g. race) specified by \(A\),
the binary outcome variable denoted \(Y\),
the decision making classifier \(C(X,A)\).
Recall how each of these are depend on human choice and hide the complex, dirty, processes of measurement and interpretation.
We start by first focusing on notion of the average quality of a classifier and examine properties of common metrics for classifiers in machine learning. On their own, these measures give us tools to pose questions about notions of sufficiency (as a concept in opposition to equality).
The next section will discuss parity measures, which quantify notions of equality we discussed before.
We then will revisit the tradeoff between maximizing aggregate utility and maintaining equality among groups
Lastly, we will look at ways these parity measures affect different steps of the modeling pipeline.
5.2. Quantitative metrics for evaluation¶
First, we’ll review measures for evaluation of a classifier, as these are technically and conceptually the building blocks of parity measures. The table below nicely summarizes the most common evaluation metrics. In particular, it organizes measures the population it conditions on (i.e. the denominator):
The total population (top row)
The predicted condition (right side)
The actual condition/outcome (bottom)
|Total population||Condition positive||Condition negative||Prevalence = Σ Condition positive/Σ Total population||Accuracy (ACC) = Σ True positive + Σ True negative/Σ Total population|
|Predicted condition||Predicted condition
|True positive||False positive,
Type I error
|Positive predictive value (PPV), Precision = Σ True positive/Σ Predicted condition positive||False discovery rate (FDR) = Σ False positive/Σ Predicted condition positive|
Type II error
|True negative||False omission rate (FOR) = Σ False negative/Σ Predicted condition negative||Negative predictive value (NPV) = Σ True negative/Σ Predicted condition negative|
|True positive rate (TPR), Recall, Sensitivity, probability of detection, Power = Σ True positive/Σ Condition positive||False positive rate (FPR), Fall-out, probability of false alarm = Σ False positive/Σ Condition negative||Positive likelihood ratio (LR+) = TPR/FPR||Diagnostic odds ratio (DOR) = LR+/LR−||F1 score = 2 · Precision · Recall/Precision + Recall|
|False negative rate (FNR), Miss rate = Σ False negative/Σ Condition positive||Specificity (SPC), Selectivity, True negative rate (TNR) = Σ True negative/Σ Condition negative||Negative likelihood ratio (LR−) = FNR/TNR|
We will now look at some of these metrics in context.
Bail Decisions (COMPAS) We (re)examine the case of decisions for pretrial release. Here, \(X\) is the COMPAS survey and the defendant’s criminal history, \(A\) is race/ethnicity, \(Y\) is whether individual will re-offend, a \(C(X, A)\) decides if the defendant should be denied (1) or given (0) pretrial release.
Pay attention to what the denominator represents in each case.
A false positive keeps the defendant in custody unnecessarily,
A false negative releases someone who goes on to re-offend,
Accuracy is the % of decisions that were correctly made,
TPR is the proportion of those who would have re-offended that were (correctly) denied release,
FPR is the proportion of those who would not have re-offended, but were unnecessarily kept in custody,
FNR is the proportion of those who would would have re-offended, who were released.
PPV is the proportion of those denied release that would have actually re-offended.
As mentioned in the last lecture, false positives and false negatives are qualitatively very different, making accuracy an inappropriate measure for algorithm quality. When is accuracy appropriate? When the false positives and false negatives have similar interpretations.
Lending Decisions Suppose you are a bank that must decide to make a loan to individuals. Here \(X\) is an individual financial history, \(A\) is any legally protected salient group, \(Y\) is whether a loan would be paid back, and \(C(X,A)\) decides if the bank should make the loan (1) or not (0). Unlike the Bail Decisions case, the positive category is a good outcome.
A false positive is a granted loan that goes on to default.
A false negative is a denied loan to someone who could pay it back.
TPR is the proportion of people who could pay back loans that were actually granted loans.
FPR is the proportion of people who would default that were granted loans.
FNR is the proportion of people who could pay back loans that were actually denied loans.
PPV is the proportion of granted loans that were paid back.
Note that both false positives and false negatives are bad for both parties:
A false positive costs the bank because they were not paid back and costs the lendee because defaulting on a loan causes financial harm.
A false negative costs the bank because they lose the opportunity to collect interest on a loan that would be paid back and costs the lendee because they were denied access to credit they deserve.
Hiring or Admissions Decisions Suppose you are a university making decisions on which students to accept to their program. Here \(X\) is the applicants academic record, \(A\) is an legally protected salient group, \(Y\) represents whether the applicant would be successful in the program, and \(C(X,A)\) decides whether to admit (1) or reject (0) an applicant. Note that the outcome \(Y\) is not well-defined. We will define ‘successful’ as ‘whether the applicant would graduate from the program’; this is a proxy that deserves your critique.
A false positive is an admitted applicant who wouldn’t graduate.
A false negative is a rejected applicant who would have graduated from the program.
TPR is the proportion of those who would graduate who were admitted to the program.
FPR is the proportion of those who wouldn’t graduate who were admitted.
FNR is the proportion of those would graduated who were rejected from the program.
PPV is the graduate rate of admitted students.
This example, while instructive is only half of the picture. Most programs have capacity limits that also influence how is admitted. A program may instead use such an algorithm to rank applicants, so that they can select from the most highly ranked students. We will look at this (very similar case) later.
Gender Shades Suppose an algorithm attempts to classify the identified gender of a person from an image. While there are many potential problems with such algorithms, they exist for a variety of reasons. For example, identifying the proportion of women displayed in the result of a Google image search may require using such an algorithm. Here \(X\) is a picture of a person, \(A\) is race, \(Y\) is whether the person identifies as a woman, and \(C(X, A)\) decides if a given image represents a woman (1) or not (0).
Note that those this \(Y\) is commonly used, it’s not a well-defined binary outcome that every individual conforms to. In fact, forcing such categorization onto individuals has broad, lasting negative impacts. This case is meant to surface this problem and how it can more severly impact people of color. See Gender Shades for a more detailed story.
A false positive is a person who the algorithm misgenders as a woman.
A false negative is a person who the algorithm misgenders as not a woman.
TPR is the proportion of those who identify as women who where correctly categorized as such.
FPR is the proportion of those who do not identify as women who were categorized as women.
FNR is the proportion of those who identify as women who were misgendered.
PPV is the proportion of those algorithmically categorized as women who identify as women.
5.3. Parity Measures¶
A parity measure is a simple observational criterion that requires on the evaluation metrics to be independent of the salient group \(A\). Such measures are static and don’t take a changing population into account. If a parity criterion is satisfied, it does not mean that the algorithm is fair. These criteria are reasonable for surfacing potential inequities and are particularly useful in monitoring live, decision making systems.
5.3.1. Demographic Parity¶
The condition for demographic parity is, for all \(a,b\in A\)
Demographic parity requires equal proportion of positive predictions in each group (“No Disparate Impact”).
The evaluation metric requiring parity in this case is the prevalence.
Legally, disparate impact is a once-side reformulation of this condition, where 80% disparity is an agreed upon tolerance decided in the legal arena:
In the examples given above, the demographic parity criterion translates to:
(COMPAS) Same proportion of “bail denied” in each group (race).
(Lending) Same proportion of “loans granted” in each group.
(Admissions) Same admission rate among each group.
(Gender Shades) Same proportion of people categorized as women in each group (race).
A few comments about demographic parity:
When demographic parity seems appropriate, what’s being assumed about the underlying distribution (prevalence)? Why does demographic parity not make sense in the Gender Shades context?
This condition doesn’t say anything about the quality of the predictions for each group. In fact, imposing demographic parity improperly may hide harmful impacts. For example, if an admissions classifier admits the majority group systematically, but randomly admits minority applicants, the impact would be a lower graduation rate for minority students at the expense of minority students who may be more successful in the program.
If the prevalence is uneven across groups, a perfect classifier will not satisfy demographic parity.
5.3.2. Accuracy Parity¶
The condition for accuracy parity is, for all \(a,b\in A\)
Accuracy parity requires equal accuracy across groups.
The evaluation metric require parity in this case is the accuracy.
While this condition fixes some shortcomings of demographic parity, it doesn’t distinguish between error types. In the case of COMPAS, accuracy parity was approximately satisfied; the algorithm makes up for detaining releasable Black defendants by wrongly releasing white defendants. Accuracy parity is appropriate when the impacts of false positives and false negatives are similar.
5.3.3. Equality of Odds (TP-Parity and FP-Parity)¶
Equality of Odds is satisfied when both True Positive Parity and False Positive Parity are satisfied.
The condition for True Positive parity is, for all \(a,b\in A\)
True positive parity is sometimes called equality of opportunity as it requires that the population that benefits from the decision (\(Y = 1\)) is given the opportunity to, regardless of the salient group. However, this is not the only notion of equality of opportunity translated to parity measures.
We leave it to the reader to derive the definition for FP-parity.
5.3.4. Predictive Value Parity (PPV-Parity and NPV-Parity)¶
Predictive value parity is satisfied when both PPV-parity and NPV-parity are satisfied.
The condition for PPV-parity is, for all \(a,b\in A\)
PPV-parity equalizes the chance of success, given a positive prediction. For example, in the admissions example, PPV-parity requires graduation rates to be equal across groups (of admitted students).
COMPAS Revisited In the story on Machine Bias, there is a back-and-forth between measuring the quality of the COMPAS algorithm.
Northpointe, in audits, used accuracy parity and TP-parity
ProPublica asserted unfairness using FP-parity.
The organizations differing interests explain the different choice in parity measures. Northpoints clients were institutions in the criminal justice system, which are largely concerned with ‘tough on crime stances’ that worry about accidentally releasing a re-offender. ProPublica was advocating for the defendants and prison inmates.
5.4. Impossibility Results¶
A natural question one might immediately ask after this onslaught of parity conditions is: can we require all of these requirements to hold? Will this cover all of our bases? As we see above, different statistics encode different values and these values sometimes conflict. The following result shows this is a fundamental observation:
Fix a non-perfect, binary classifier \(C(X, A)\) and outcome \(Y\). If the prevalence of \(Y\) across \(A\) is not equal, then:
If \(A\) and \(Y\) are not independent, then Demographic Parity and Predictive Value cannot simultaneously hold.
If \(A\) and \(C\) are not independent of \(Y\), then Demographic Parity and Equalized Odds Parity cannot simultaneously hold.
If \(A\) and \(Y\) are not independent, then Equalized Odds and Predictive Value Parity cannot simultaneously hold.
Note that these requirements hold for most classifiers in real contexts:
Classifiers are almost never perfect predictors.
Base-rates of outcomes rarely are equal across groups.
\(A\) and \(Y\) are usually associated when issues of fairness are relevant for the group in question.
\(C\) and \(Y\) are usually associated, if your classifier is any good.
The proof of the theorem follows from two algebraic identities (\(p\) is prevalence):
You can verify these identities from the definitions of the evaluation metrics. Can you verify the proof? (Write out the relevant quantities for each group and attempt to make the needed quantities equal).
How do you interpret this result? Imperfect classifiers will naturally require trade-offs. In the example of the college admissions algorithm, the trade-off between Equalized Odds and Predictive Value Parity translates as follows: when admitting more qualified applicants who are able to graduate in a given group (Equalized Odds), the imperfect predictor will also admit unqualified candidates that will lower the graduation rate for that group (Predictive Value Parity).
In the next section, we will interpret all of these concepts using our frameworks of distributive justice.