User Guides

From The Foundation for Best Practices in Machine Learning
Jump to navigation Jump to search

How to use the Best Practices in daily work[edit]

Hint
Do not be daunted by the size of the Guidelines. It is okay to start by just looking up the parts relevant to the questions or dilemmas you currently have.

General Advice[edit]

Risk Assessment

Perform a risk assessment of your Machine Learning operations. Start by identifying the following 3 aspects of your organisation. This is important to understand to help you appreciate the context of your activities.

  1. The Domain of your organisation;
  2. The demographics of your users and the people you will affect;
  3. Are your products involved in any of the following? a) physical and mechanical environments, b) making decisions about people, or c) influencing peoples' behaviour.

Certain concepts of the Best Practices will be more relevant than others for different contexts. Identify which concepts are the most important for you context based on your answers to the above 3 questions.

Apply a risk assessment to the concepts identified as important above, the perform a gap analysis based on that. After identifying the subjects with the highest relevance to your current work in this way, find the relevant sections in the Best Practices and start bridging the gap analysis between your current practice and the Best Practices.

Policies

The policies described in the Organisation Guidelines can and should be created centrally so that all product teams can take advantage of them without having to reinvent the wheel. Is your product team pioneering the usage of the Best Practices within your organisation? Connect your work to the Organisation Guidelines and take the initiative to create these policies for you organisation.


Situation-based Advice[edit]

We are currently working on advice, tips and tricks that are specific to certain situations. For example, based on an organisation's size or maturity, on a product's risk rating, or aimed at specific roles in your organisation. Please check back often!

User Guide: Choice of Fairness Metric[edit]

Introduction[edit]

In the responsible development of machine learning decision-making systems, the question often arises “what fairness metric(s) should I assess my system on during development and after deployment?” Selecting a suitable fairness notion for a given machine learning decision-making system is not easy. It is, however, important. If used incorrectly, common notions of fairness may lead to unintentional and potentially harmful consequences, especially for historically disadvantaged populations (Bakalar, 2021). In this user guide, we provide some considerations to help inform the choice of a fairness metric. Functionally, this document serves as a deeper dive into the concepts raised, but not fully explored in the Technical Best Practices. The relevant controls from the Technical Best Practices are referenced in each section.

The Foundation for Best Practices in Machine Learning acknowledges that fairness is complex, multidimensional, and inherently context-dependent. Our Technical Best Practices recommend consulting domain experts when determining which fairness metrics are contextually most appropriate. The purpose of fairness testing is to identify and mitigate the risk of disproportionately unfavorable outcomes for protected populations, to prevent reinforcing social inequalities, and to ensure compliance with anti-discrimination laws and regulations.

While mathematical measures of fairness are insufficient to fully address the issue of fairness in machine learning decision-making systems, properly chosen fairness metrics can help improve model equity. The use of these metrics before deployment can uncover areas of concern within the data or the model, hence allowing issues to be addressed in the pre-production environment. Post-production, the use of these metrics can be useful in detecting differences between production and training, as well as shifts in the data, possibly caused by feedback loops with the model (for more on these topics, see the Technical Best Practices, in particular sections 11, 15 and 20; for example 15.3.4., 20.2.1, 20.2.3, 20.2.4, 20.3.2., 20.4.1 and 20.4.3).

The choice of which metric to use for fairness testing is especially challenging given that more than twenty different fairness metrics have been proposed (Narayanan, 2018), some of which are mutually incompatible (Friedler, 2016; Mitchell, 2020; Kleinberg, 2018). We limit our scope to group fairness metrics that ensure some form of statistical parity for members of different protected groups (e.g. sex, age, ethnicity, disability status).

This document is intended to be used as a guide. The purpose of the following non-exhaustive list of considerations is to guide the discussion about which fairness notions and metrics are appropriate for which decision-making scenarios. Read through each consideration and determine if it applies to your use case and context (e.g. sociocultural context, and local legal and regulatory frameworks). Where you find alignment, follow the guidance.

Considerations[edit]

In the following considerations, we assume that the decision-making system is a machine learning regression or classification model whose predictions or decisions directly affect human subjects. Our aim is to ensure the fair treatment of two or more predetermined groups whose members differ with respect to one protected attribute. The fairness considerations are phrased as yes/no questions that you should ask yourself with respect to your use case and application domain, and on the basis of which certain group fairness metrics are either recommended or discouraged. If appropriate, these considerations and their answers should be included in the documentation for the model.

1. Does your model face legal or regulatory requirements?[edit]

In some settings, users of algorithmic models face regulatory requirements that motivate use of certain fairness metrics over others. Such regulations vary by locality, by industry, and can vary over time if laws (or regulators’ interpretations of laws) are changed. In the USA, for example, models that affect provision of credit or employment are highly regulated by specific anti-discrimination laws while models used in social media or sales/marketing are nearly entirely unregulated. Across industries and countries, there are relatively few laws concerning algorithmic fairness and even fewer that specify or imply use of specific fairness metrics, but if such laws do exist, they should always be the starting point for choice of metric(s). Subject matter experts and attorneys in relevant fields should be consulted to ensure that a model builder’s thoughtful consideration of fairness metric choice is in alignment with relevant laws.

2. Are the outcomes of the system rankable?[edit]

If outcomes of a system are rankable, a preponderance of subjects would have identical preferences for the outcome they receive under the model. For example, loan applicants will prefer to be granted rather than denied a loan. A model subject receiving a credit score will prefer a higher score rather than a lower one. Such a decision is referred to as polar (Paulus, 2020; Kent, 2020). When decision making systems allocate limited resources, or when competing interests exist between system owner and system subject, the outcomes of those systems are almost always rankable.

Machine learning system outcomes are not always rankable. For example, consider a patient with knee pain receiving an x-ray scan, where the scan outputs will be processed by a computer vision model and presented to a physician. Does the patient prefer to receive a high score (indicating a high probability of developing osteoarthritis), or a lower score? If the patient is wrongly given a high score, they may be subject to unnecessary intrusive interventions. In this case, the model subject is much more concerned about the accuracy of the system, rather than just the outcome that they receive.

Some fairness metrics are only applicable if outcomes are rankable. You should ask - ‘Does this metric assume that one outcome is desirable in the eyes of the model subjects?’ Measures of disparate impact fall in this category - the adverse impact ratio and Cohen’s d (standardized mean difference), for example. Most metrics are applicable whether or not model outcomes are rankable, including those that measure equality of odds, calibration, and conditional statistical parity.

One question is what metrics are applicable to rankable situations. Another question is what metrics we recommend. For rankable decisions, group fairness is appropriate given that your aim is to have similar rates of “beneficiaries” among different categories of subjects.

3. Will the model output be subject to multiple thresholds or use cases?[edit]

If the usage of your model changes, it should be re-tested for fairness concerns. If you are unsure about all final usages of your model, it is best to test a wide range of possible decision thresholds.

A classification model may be used in a scenario where the decision threshold is not fixed, but rather tuned to accommodate a changing context. For example, a child maltreatment risk assessment system may tune the threshold for intervention depending on the available capacity in foster homes. The following metrics are suitable for decision making scenarios with floating thresholds:

  • If emphasis is on precision: calibration
  • If emphasis is on recall: balance for positive class or balance for negative class

Helpful controls: 3.3 Societal Context, 13.4.2 Output Interpretations, 14.1.1 Real World Robustness

4. Are the base-rates of the decision outcome similar across subpopulations?[edit]

The base rate is the proportion of the positive condition in a population (Makhlouf, 2020). For example, the base rate of certain diseases (e.g. diabetes) is approximately the same for men and women; whereas the base rates of other diseases (e.g. prostate cancer, breast cancer) differ between men and women.

Differences in the base rate between populations are the fundamental reason why many group fairness metrics are mutually exclusive. If the base rates are equal between populations of interest, then it becomes possible to ‘distribute’ both the false negatives and false positives fairly between the populations, for many definitions of ‘fair’ simultaneously. If the base rates are not equal, it is mathematically impossible to satisfy more than 2 group fairness metrics simultaneously.

Therefore, if the base rates between all relevant populations differ significantly, be aware that complete the choice of fairness metric goes hand-in-hand with an assessment of which aspect of fairness should prevail in the given context. Make this assessment consciously and explicitly, and record the decision making process. Consideration 7 of this document will be of particular importance in these cases.

See control: 13.2.4 Selection Function for more information on determining base rates. If base rates are unavailable or of dubious quality, the above warnings remain relevant. In this case, consider what would be the safer assumption to make. One may also consider counterfactual fairness here.

5. Is the outcome discrete or continuous?[edit]

The outcome of a model-based process refers to the final result that affects the person or subject being modeled. Such outcomes can be discrete (yes/no, high/medium/low) or continuous (a score between 300-850, a probability between 0-100%). Oftentimes, models will output a continuous score like a probability and that score will be subject to some transformation to result in a discrete outcome. For example, if an airline uses a model to predict the likelihood that a customer becomes a frequent flyer and sends offers of discounted airfare to people with a probability above 15%, the outcome is whether the customer receives the offer of discounted airfare. While a responsible model builder could test whether there are fairness concerns with the continuous probability output, the primary outcome for fairness testing in this example should be the offer of discounted airfare because that outcome actually results in different treatment for customers. In general, the final outcome in a model-based process should be the focal point of fairness testing.

Many fairness metrics can only be used for either discrete or continuous output. Metrics that rely on confusion matrix statistics (True/False Positive/Negative) such as Equalized Odds can only be applied to discrete outcomes. Other metrics like Standardized Mean Difference rely on differences in averages, which are more appropriately applied to continuous outcomes. A quick examination of the mathematical definition of a fairness metric will usually reveal whether the metric relies on counts of discrete “Favorable”/“Unfavorable” outcomes or whether it can accommodate a continuous score.

6. Is there a sample size disparity?[edit]

The sample size of a population (the number of individuals included in the dataset), or rather the difference in sample size between populations has two important effects with regard to model accuracy and model fairness. Firstly, the population with a significantly larger sample size will typically dominate the model’s overall accuracy score and (therefore) be more accurately modelled. This could mean minorities will receive on average lower quality decisions, even if a relevant fairness metric has been optimized. Differences in model quality across subpopulations (i.e. differential validity) should be evaluated if there are concerns that one subpopulation will “dominate” the algorithm.

Secondly, smaller sample sizes make measurements of bias more uncertain. In other words, it becomes more difficult to establish both the magnitude of the bias and its statistical significance with high fidelity.

If some relevant populations have such a significantly and problematically smaller sample size, the correct and responsible methodology is to formulate the null-hypothesis that minority populations are disadvantaged on both fronts, and that to reject this null-hypothesis requires a statistically significant measurement of non-bias. In other words, the burden of proof lies with the majority population.

Controls related to robustness and fidelity: 14.3.2, 14.3.3, 14.3.6, 14.3.8

7. How harmful is a false negative/positive?[edit]

If an algorithm is used to classify subjects, meaning to predict whether subjects belong in some positive or negative class, it is common to evaluate performance using a confusion matrix. For example, a diagnostic model predicting whether an individual suffers from diabetes will classify people as either diabetic or not diabetic patients. In certain scenarios, including the one described above, false negative and false positive classifications can carry dramatically different consequences for the modelled subject. A test indicating that the patient does have diabetes when in reality they do not (a false positive) would likely lead to additional testing and emotional stress until the truth was discovered on further examination. A test indicating that the patient does not have diabetes when in reality they do (a false negative) could prevent the patient from receiving treatment they need, which could potentially be life threatening.

Fairness testing can be flawed if differential consequences are not taken into account. In the diagnostic test example, testing a model for differential validity (equal predictiveness among subpopulations) using accuracy as a performance metric could obscure important fairness concerns. It is possible for a model to create more harmful consequences for a disadvantaged group, despite being just as accurate for that disadvantaged group (that is, as measured by the proportion of correct classifications), due to a higher false negative rate, for example.. It could be more appropriate to test for equality in sensitivity or true positive rate, which ensures that among patients with diabetes, the same proportion of patients are being correctly classified as diabetic (this is also known as equalized opportunity testing). If the emphasis is on the false positives, use predictive parity. If the emphasis is on the false negatives, use equal opportunity. If both false negatives and false positives are equally important, use equalized odds.

The following controls in the Technical Best Practices relate to the real world consequences of false classifications, and differences between false negatives and false positives: 11.3.9, 11.3.5, 13.4.1, 19.3.2. See also http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/ for a useful tool for assessing fairness metrics based on this context.

8. Is ground truth available?[edit]

The ground truth is the true observed outcome corresponding to a given subject. It should be distinguished from inferred, and possibly subjective, outcomes that are recorded in historical data (Makhlouf, 2020). An example of a scenario where the ground truth is not available is whether a job applicant should be hired by a technology company.

In cases where there is observation bias (we do not observe ground truth outcomes for all individuals, only those who were given a positive outcome in the past) and/or proxies are used instead of ground truth, there is an increased risk of model outcomes disproportionality harming historically marginalized groups. Refer to consideration 11 for more discussion of this case. In such cases, any metrics that utilize the actual outcomes in their formulation will be invalid since whether any instance is a True or False positive (or negative) is not known. Valid metrics to use are those that include only the predictions and group membership, such as disparate impact metrics.

When the whole ground truth is available (or the proxies are of very high quality) disparate impact is no longer the most suitable notion of unfairness. when disproportionately beneficial outcomes for certain sensitive attribute value groups are justified. In such situations, disparate impact risks introducing reverse-discrimination against qualified candidates (which is unlawful).

Example: An algorithmic system developed by Optum (owned by insurance company UnitedHealth) that is widely used in American hospitals to allocate more personalized healthcare to high-risk patients was observed to systematically discriminate against black people. Black patients were generally assigned lower risk scores than equally sick white people, as a result of which, black patients were less likely to be referred to clinical engagement programs. The algorithm assigned risk scores to patients on the basis of total health-care costs accrued in one year (along with other factors such as clinical expertise). The model was built with the assumption that healthcare costs are representative of a patient’s health needs, which seemed reasonable because high healthcare costs are generally associated with serious health conditions. Yet, due to the difference in the prevalence in health insurance between the groups, black patients tend to have lower healthcare costs than white patients despite being more affected by chronic conditions such as diabetes, anaemia, and kidney failure. Thus, while high healthcare costs are associated with serious health conditions, the proxy is more predictive of health insurance than serious health conditions.

[related controls: 13.2.4 selection function; 13.3.1 target subjectivity, 13.3.2 target proxies;

9. Will your model outputs affect data collected in the future?[edit]

Feedback loops are constituted when the output of a model influences its future inputs. As the prediction of an algorithm might influence the newly obtained observations, retraining of models can result in self-amplifying effects.

Consider the example of a bank which offers loans to people who obtain a high algorithmic credit score. The group which is rated as financially unstable is denied access to financial products, which might deteriorate their financial situation even more. If the model is retrained after some period of time, the people denied in the first model will be given an even lower credit score this time, closing the loop. Another example of an automated decision-making system that is prone to feedback loop behavior is a predictive policing system that may be used to determine which neighborhoods to patrol in order to prevent crime. Predictive policing models are trained on historical crime data, and, once a decision has been made to patrol a certain neighborhood, crime observed in that neighborhood will be fed into the model for the next round of decision-making. A feedback loop arises because crime will only be accounted for in neighborhoods that police officers have been previously sent to by the predictive policing model itself (Ensign, 2019). Given that historical crime data usually points to impoverished, ethnically diverse neighborhoods, such a feedback loop risks reinforcing existing social inequalities.

Avoiding, or accounting for, feedback loop behavior after deployment is a requirement for systemic stability, which is a form of robustness related to the model’s interaction with society (see section 20 of the Technical Best Practices). Responsibly accounting for feedback loop behavior is covered in control 15.3.4. (awareness of feedback loops), generally in all controls of section 20 but most specifically in 20.2.1, 20.2.3, 20.2.4, 20.3.2., 20.4.1 and 20.4.3 in the Technical Best Practices.

10. Are you deploying your machine learning decision making system in a context where a given population is known to have been discriminated against?[edit]

YES → substantive equality

Bias transforming metrics should be used in contexts where significant disparity (that is not legally justified) has been previously observed between populations (Wachter, 2021). Bias transforming metrics do not blindly accept the status quo as a given that should be preserved: their aim is to ensure substantive equality. Group fairness or statistical (demographic) parity, conditional independence or conditional statistical (demographic) parity, and counterfactual fairness are bias transforming metrics (Wachter, 2021). Bias transforming metrics are satisfied by matching decision rates between groups. For example, group fairness is satisfied if positive decisions are made at the same rate across the relevant groups so that there are equal proportions of each group in each outcome class.

NO → formal equality

Bias preserving metrics ensure formal equality (Wachter, 2021). Bias preserving fairness metrics seek to reproduce historic performance, as reflected in the training data, in the decision making system’s outcomes. It is implicitly assumed that various forms of bias in the historic data are there for a reason and should be preserved. Predictive parity, equalized odds, equal opportunity or false negative error rate balance, false positive error rate balance, and calibration are all bias preserving metrics. Bias preserving metrics require matching error rates between groups (Wachter, 2021). For example, equalized odds requires the ratio of true positive to false negative decisions to be the same across groups, and for the ratio of true negatives to false positives to also be matched across groups.

Out of Scope[edit]

While the scope of this document is limited to group fairness metrics, they are not the only fairness notions that should be considered when developing machine learning decision-making systems. Below is a list of additional considerations and concepts that are out of scope for the current document since they do not directly relate to the choice of fairness metric. Nonetheless, they should be considered when evaluating the fairness of a variety of decision-making scenarios.

  • Individual fairness is a concept that aims to ensure that similar individuals receive similar outcomes, regardless of their protected attributes. The similarity of a pair of individuals is defined in terms of a distance metric that is difficult to choose in practice. Models that incorporate randomness into the decision-making process, for example, can violate notions of individual fairness and result in scenarios where otherwise identical model subjects receive different outcomes. While models lacking in individual fairness may not violate anti-discrimination laws (which generally consider whether outcomes for one group are on average worse than another), they can still lead to outcomes that violate other laws or are intuitively offensive.
  • Counterfactual fairness states that a decision is fair toward an individual if it coincides with the one that would have been taken in a counterfactual world in which the sensitive attribute of the individual had been different. This approach assumes prior knowledge of the data generating process and of the causal relationships between predictive features. We do not address counterfactual fairness in this document as it is not directly tied to specific metrics. However, counterfactuals specifically address two key areas of ethical concern with AI systems - causality and opaqueness. “The Ladder of Causation, consisting of (i) association (ii) interventions and (iii) counterfactuals, is the Rosetta Stone of causal analysis,” (Pearl, 2020). In addition to providing a foundational system for causal analysis, counterfactuals have increasingly become a key focus for explainability methods for AI systems (Chou, et al. 2021).
  • Feedback loops are constituted when the output of a model influences its future inputs. As the prediction of a decision making system might influence the newly collected data, retraining of models can result in self-amplifying effects. For example, predictive policing systems that determine which neighborhoods to patrol in order to prevent crime are prone to feedback loop behavior. They are trained on historical crime data, and, once a decision has been made to patrol a certain neighborhood, crime observed in that neighborhood is fed into the model for the next round of decision-making. A feedback loop arises because crime will be reported more often in neighborhoods that police officers have been previously sent to by the predictive policing model itself. Given that historical crime data usually points to impoverished, ethnically diverse neighborhoods, such a feedback loop risks reinforcing existing social inequalities. Avoiding, or accounting for, feedback loop behavior after deployment is a requirement for systemic stability, which is a form of robustness related to the model’s interaction with society (see section 20 of the Technical Best Practices). Responsibly accounting for feedback loop behavior is covered in control 15.3.4. (awareness of feedback loops), generally in all controls of section 20 but most specifically in 20.2.1, 20.2.3, 20.2.4, 20.3.2., 20.4.1 and 20.4.3 in the Technical Best Practices.
  • Intersectional fairness consists in evaluating fairness across intersections of two or more protected attributes. For example, “black women” is the intersection of a protected group relating to race and of another protected group relating to gender. We do not include intersectionality with the main considerations because it can be evaluated using any group or individual fairness metric. One approach to intersectionality could be to evaluate fairness across every possible intersection of protected attributes. However, the number of subgroups to consider grows exponentially with the number of attributes considered, which makes it difficult to inspect each subgroup for fairness due to data sparsity issues (Morina, 2020). As a result, intersectional group fairness approaches may not scale well, and may be at a risk of detecting spurious relationships and/or overfitting (Binns, 2019). Well-established group fairness metrics can be adapted to account for intersectionality using the differential fairness framework introduced by Foulds et al. (Foulds, 2019): statistical parity (Foulds, 2019), false positive rate and true positive rate parity (Morina, 2020). Multi-calibration can also be used to ensure calibration across subgroups of a protected population (Hebert-Johnson, 2018). Although a large majority of the algorithmic fairness literature thus far has focused on fairness with respect to a single sensitive attribute, in practice, ensuring intersectional fairness is a requirement for responsible machine learning decision-making systems.

Conclusion[edit]

Quantitative fairness testing is a critical piece of responsible model development and deployment. In some locales and application domains, the question of ‘which fairness metric should I evaluate?’ is strictly dictated by laws and regulations. For everyone else, this guide offers a series of considerations that should inform the appropriate choice of such metric(s). It is our hope that this guide, used in conjunction with the Technical Best Practices and in consultation with legal and domain experts, can help narrow down the appropriate choice of metrics for your application.

This guide does not directly address many closely-related questions, including: Is it appropriate to use proxies for group information if I do not have it available for every observation? Which groups, or combinations of groups, should I evaluate metrics on?

This latter question is especially significant, since social harm should not be analyzed as affecting marginalized groups along individual dimensions (race, sex, gender expression, etc). That discriminatory practices disproportionately affect people who occupy multiple marginalized groups has been demonstrated across many domains, such as health care and employment. Recently, Buolamwini and Gebru showed that some commercially available facial recognition systems have substantial gender classification accuracy disparities, with darker-skinned women being the most misclassified group.

For the answers to these questions and others, we again direct the interested reader to our Technical Best Practices, as well as the academic literature cited herein. Finally, we have provided below a collection of useful resources from industry leaders and academic groups.