FM fairness / bias issues

Bias

In this session, our readings cover:

Required Readings:

Evaluating and Mitigating Discrimination in Language Model Decisions

More Readings:

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Machine Learning in development: Let’s talk about bias!

Exploring Social Bias in Chatbots using Stereotype Knowledge WNLP@ACL2019

Bias and Fairness in Large Language Models: A Survey

A Survey on Fairness in Large Language Models

Blog: In this session, our blog covers:

Bias and Fairness in Large Language Model

1     Formal Definition of Bias and Fairness (LLM context)

1.1   Preliminaries

1.2   Social Bias and Fairness

1.3   Bias in NLP Tasks

1.4   Fairness Constraints

2     Taxonomy of Metrics used to evaluate Bias

2.1   Facets of Metrics

2.2   Taxonomy of Metrics based on What They Use1.

2.3   Embedding-based Metrics

2.4   Probability-based Metrics

2.5   Generated Text-Based Metrics

3     Taxonomy of Datasets used to evaluate Bias

3.1   Counterfactual Inputs

3.2   Prompts

4     Taxonomy of Techniques used to mitigate Bias

4.1   Pre-processing Mitigation

Pre-processing mitigations modify model inputs (data and prompts) without changing the trainable parameters.

Data augmentation techniques seeks to neutralize bias by adding new examples to the training data that extend the distribution for under- or misrepresented social groups

4.2   In-Training Mitigation

Aim to modify the training procedure to reduce bias. It modifies the optimization process by

+   changing the loss function
+   updating next-word probabilities in training
+   selectively freezing parameters during fine-tuning
+   identifying and removing specific neurons that contribute to harmful outputs

Changes to the configuration of a model, including the number, size, and type of layers, encoders, and decoders

—   debiasing adapter modules, called ADELE, to mitigate gender bias
—   Ensemble models may also enable bias mitigation, gated networks

Distance-based embeddings: 

Projection-based embeddings: 

Mutual information-based embeddings: 

Attention-based embeddings: 

Future research can better understand which components of LLMs encode, reproduce, and amplify bias to enable more targeted in-training mitigations.

4.3   Intra-Processing Mitigation

Take a pre-trained (perhaps fine-tuned) model as input, and modify the model’s behavior without further training or fine-tuning to generate debiased predictions at inference; as such, these techniques may also be considered to be inference stage mitigations.

4.4   Post-Processing Mitigation

Post-processing mitigation refers to post-processing on model outputs to remove bias

4.5   Open Problems and Challenges

Evaluating and Mitigating Discrimination in Language Model Decisions

1   Language Model for Decision Making

1.1   Use Cases

Language models are now being used in making a variety of decisions. Many of these decisions are very important and high-stakes in nature.

One type of decision where language models are being considered are for societal decisions. Some examples include:

In the medical field, language models can be used for:

In the field of academics and standardized testing, language models are used for:

Clearly, such decisions have massive, widespread consequences for people’s lives and livelihoods. An immediate concern is whether discrimination can be introduced by use of language models for these decisions.

Thus, it becomes crucial to proactively anticipate and mitigate any potential risk of discrimination in these decisions.

1.2   Paper Overview

The paper “Evaluating and Mitigating Discrimination in Language Model Decisions” by Tamkin, et al. aims to: 1) Evaluate the potential for language model discrimination across different applications 2) Generate a diverse set of hypothetical prompts that people could use to query models for automated decision-making. Each prompt instructs the model to make a hypothetical binary decision about a particular person described in the prompt.

An overview of the approach the authors took can be seen in the following image:

The approach can be split into 4 steps, which are described in more detail below.

Step 1: Generating Decision Topics

First, prompt an LLM with an intial prompt. This initial prompt asks the LLM to provide examples of decision problems, providing it with a few seed examples from finance, law, education, etc.

The authors iteratively generate more topics by providing the language model’s responses as further context, and asking for more generated examples.

The following image shows the prompts used for generating default decision questions.

When doing an analysis of the generated questions, we see that there are 70 deciion questions, which range from higher risk to lower risk.

Human validation was also done, with raters asked to rank each question’s overall quality. The average score was 4.76 out of 5.

Step 2: Generating Template Questions

The next step is to generate decision question templates with placeholders for demographic information. To do this, the language model was provided a prompt specifying the desired structure and content of the templates.

The LLM is given an example template, with placeholders for age, race, and gender. The prompt instructs the model to generate a template for a different decision topic which uses these placeholders. In this way, they ensure that the question is a yes or no question.

The following image shows how generation of question templates was completed:

Step 3: Filling the Templates

The third step is to actually fill the templates. The nature of the decision templates allow for creation of multiple versions of the same decision prompt, where the demographics of the subject are the only changing variables.

The language model is used to insert random combinations of age, race, and gender into the placeholders. The following image shows how the templates are filled:

Step 4: Generating Decisions

Finally, the language model is used to generated the decisions for the different decision prompts. The language models is asked to answer a given question with either “yes” or “no.”

2   Assess Discriminative Effect

2.1.   Mixed Effect Model

We fit a mixed effects linear regression model to estimate discriminative effect.

2.2.   Discrimination Score (DS)

Discrimination Score is defined by $\beta + \mu$ where $\beta$ (fixed effect coefficient) and $\mu$ (random effect coefficient), which are relative to baseline.

2.3.   Positive & Negative Discrimination in Claude

We see patterns of positive and negative discrimination in Claude given that the reference is 60 year old white male.

The patter of discrimination score holds for decision questions with explicit setting.

2.4.   Prompt Sensitivity

To evaluate prompt sensitivity, we test how varying the format and style of our prompts affects model decisions.

For variations in the question style and format, we use $6$ different formats alternating the original decision templates.

2.4.1   First Person Phrasing

We rephrased the scenario in first-person perspective, changing pronouns to “I” and “me” instead of third-person.

2.4.2   Formal bulleted list

We rewrote the details as a bulleted list of factual state- ments written in a formal, detached style.

2.4.3   Pro-con list

We rewrote the information in the question as a list, formatting the key facts as bullets under “Pros” and “Cons” headers.

2.4.4   Emotional phrasing

We added emotional language, such as “I really just want to make the right call here” and “This choice is incredibly important.

2.4.5   Sloppy rewrite

We introduced typos, lowercase letters, and omitted words to make the prompt appear informal and sloppily written.

2.4.6   Use coded language

We incorporated subtle coded demographic language, such as “looking for a clean-cut all-American type”. This evaluates our model’s sensitivity to subtle po- tential indications of discriminatory preferences from users.

2.5.   Effect of Prompt Variation

The patterns of discrimination score are consistent across prompt variations.

3   Prompt Designing: Mitigation Techniques

3.1.   Appending statements to prompts

We append various statements to the end of prompts:

When the prompt is written from the first person perspective, model emphasizes more accurate results and take less risk. Biases are injected through data. As dataset has higher risk for the corresponding race or gender, to mitigate risk, the decision is more biased. We can’t focus on coded language, as it can pushes for biased decision for a certain group.


3.2.   Results

As shown in Figure 5, several of the interventions we explore are quite effective, especially Illegal to discriminate, Ignore demographics, Illegal + Ignore. Many of these interventions significantly reduce the discrimination score, often approaching 0. Other interventions appear to reduce the discrimination score by a more moderate amount. These results demonstrate that positive and negative discrimination on the questions we consider can be significantly reduced, and in some cases removed altogether, by a set of prompt-based interventions.


3.3.   Do the interventions distort the model’s decisions?

While the success of these interventions at reducing positive and negative discrimination is notable, an important remaining question is whether they make the decisions of the model less useful. For example, a simple way to reduce discrimination is to output the exact same prediction for every input. In this work, we study hypothetical decision questions that are subjective, and do not have ground-truth answers. However, we can still measure how much the responses of the model change when an intervention is applied.

Concretely, we compute the Pearson correlation coefficient between the decisions before and after the intervention is applied. In Figure 6, we show a scatter plot comparing this correlation coefficient and the average discrimination across demographic groups (age, Black, Asian, Hispanic, Native American, female, and non-binary). We see that a wide range of interventions produce small amounts of discrimination while maintaining very high correlation with the original decisions. Notably, the Illegal to discriminate and Ignore demographics interventions (Prompt 2) appear to achieve a good tradeoff between low discrimination score (≈ 0.15) and high correlation with the original decisions (≈ 92%).

4.   Discussion

Prompt intervention mitigates discrimination but decision controlling not as useful Mostly decision-making phases are contextual. Biases is not defined explicitly. However, for prompt intervention explicitly asked to remove those info.

Intervention maintains a high correlation with the original decision

4.1   Limitations

4.2   Should models be used for the applications we study?

4.3   How should positive discrimination be addressed?

The complex issue of positive discrimination identified by their research and recognizes the ongoing debates surrounding its correction. Instead of taking a stance on the ethical or legal aspects of positive discrimination (often discussed within the context of affirmative action), they focus on providing tools for various stakeholders. These tools:

4.4   Where does this behavior come from

5   Conclusions

In summary, this work draws on a rich foundation of techniques across machine learning and the social sciences to proactively assess and mitigate the risk of language model discrimination.

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

1   Gender Bias Provocation and Mitigation in LLM

This paper proposes a novel method to automatically detect and mitigate bias in large language models (LLMs) like ChatGPT and GPT-4.

Current methods:

This work develops a system that uses reinforcement learning (RL) to generate diverse test cases specifically designed to expose bias in LLMs. Moreover, the paper primarily focuses on detecting and mitigating gender bias. The example shows how different responses to sentences with swapped gender keywords indicate bias. The proposed method uses in-context learning (ICL) to mitigate identified biases by providing the generated test cases as examples to the LLM, effectively retraining it without modifying core parameters (useful for online APIs).

Key contributions:

Bias Mitigation in Natural Language Generation

Researchers are increasingly concerned about societal bias reflected in natural language generation (NLG) systems. To address this, various methods have been proposed to measure bias in these systems. Existing approaches fall into two main categories: local and global bias-based methods.

Local methods rely on hand-crafted templates with masked words. Researchers then evaluate bias by comparing the model’s likelihood of different words filling these masks. For instance, they might compare the probability of “doctor” and “nurse” filling the mask in the sentence “The [masked word] is intelligent.” For example, the template can be a sentence with some masked words. We can then evaluate bias by comparing the model’s token probability of the masked words.

Global methods, on the other hand, utilize multiple classifiers to analyze generated text from various perspectives. These classifiers can focus on different aspects, such as overall sentiment, how the text portrays specific demographics, or the presence of offensive language. For example, using sentiment to capture overall sentence polarity, regard ratio to measure language polarity and social perceptions of a demographic, offensive6, and toxicity as classifiers.

Bias Mitigation in Natural Language Generation

To reduce bias in natural language generation (NLG), researchers have adopted two main approaches: modifying the algorithms themselves (algorithm-based) and improving the training data (data-based).

Algorithm-based methods aim to adjust the NLG model internally. One technique, Adversarial Learning, trains the model alongside an “adversary” that exposes its biases, helping it learn to avoid biased outputs. Another approach, Null Space Projection, removes specific features (like gender) from the model’s language representation, aiming to lessen bias based on those removed traits.

Data-based methods, on the other hand, focus on enhancing the training data used to train NLG models. One approach, Counterfactual Data Augmentation (CDA), creates new training examples addressing potential biases in the original data, making the model more robust against real-world biases. Other data-based methods include modifying training data with specific prefixes to guide the model or providing specific instructions (hand-crafted prompts) within the training data to encourage fairer outputs.

What is NEW in this paper?

Bias Mitigation

Proposes a gradient-free method which can mitigate LLM API’s biases without accessing and updating their parameters. Extends the context in ICL toward bias mitigation by utilizing and transforming bias examples into good demonstrations to mitigate bias

Bias Investigation

Introduces a novel way to automatically synthesize test cases to measure global biases by leveraging reinforcement learning. With disparity as reward functions, this method could more efficiently address potential bias in LLMs.

Summarized contributions :

3.   Methodology

In-context learning (ICL) (Dong et al., 2022) serves as another paradigm for LLMs to perform NLP tasks, where LLMs make predictions or responses only based on contexts augmented with a few demonstrations. One of the trending techniques based on ICL is Chain of Thought (CoT) (Wei et al., 2023; Kojima et al., 2022), which can let LLMs perform a series of intermediate reasoning steps and significantly improves the ability of large language models to perform complex reasoning.

Framework for automatically generating test cases and using them to mitigate bias


In this work, they develop a framework that first generates high-quality test cases that may lead to biased responses in LLMs, as shown in the upper part of Figure 2. Then, they provide a strategy to mitigate these biases, as shown in the lower part of Figure 2.

3.1.   Bias Provocation

This paper defines bias in large language models (LLMs) as generating different sentiments for two sentences that differ only in gender-specific terms. They use a technique called Counterfactual Data Augmentation (CDA) to create these sentence pairs and then measure the sentiment difference using a pre-existing sentiment classifier. A larger difference indicates a stronger bias.

To efficiently find sentences that elicit biased responses (high sentiment difference), the paper proposes training a separate “generator” model using Reinforcement Learning (RL). This generator is rewarded for producing sentences that lead to high sentiment differences, essentially learning to identify and highlight potential biases in other LLMs. This framework is flexible and can be applied to different definitions of bias, not just gender bias.

3.2.   Bias Mitigation

This paper tackles bias in large language models (LLMs) by first identifying it. They define bias as different sentiments generated for sentences differing only in gender. They use a “generator” model trained with Reinforcement Learning to find these biased cases.

Next, they aim to fix the bias using “in-context learning” (ICL). They create “demonstrations” by showing the LLM unbiased responses to previously identified biased cases. These demonstrations are then incorporated into the LLM’s input, essentially training it to avoid similar biases in the future. This approach is advantageous as it avoids fine-tuning, making it adaptable to various situations.

4   Bias Provocation Experiments:

4.1   RL Algorithm

Reinforcement Learning (RL) is used to train the generator model. The model aims to maximize the expected bias it detects in other LLMs (represented by Ex∼πg [r(x)]). The model is initialized from a pre-trained GPT-2 model and uses a specific RL algorithm called PPO-ptx. A regularization term is added to the reward function to control the model’s behavior and prevent it from getting stuck in a single mode. The reward designed for a test case x is

Maximizing the combined objective function in RL training:

4.2   Evaluations:

4.3   Results:

The left segment of Table 1, labeled as ‘Provoking Bias’, showcases the results from each target LLM distinctly represented in three rows. We observe that P-Chat and FT-Gen share a similar sentiment gap. We also observe that after applying RL to provoke bias, each of the three target LLMs has a larger sentiment gap. This finding suggests that our approach has successfully identified a set of test cases capable of eliciting more biased responses, surpassing those identified by P-Chat and FT-Gen.

Table 2 is divided into two sections: Before RL highlighting the PPL and Self-BLEU scores of the initial test cases and After RL showcasing the scores of the test cases generated after the RL training. In the After RL section, there is a marginal increase in PPL scores, signifying a minor drop in the quality of sentences by post-RL generators. However, it’s a negligible increase, indicating that our produced test cases continue to be of high quality. Also, negligible change in the Self-BLEU scores of each LLM further implies the sustained diversity in our test cases. In summary, Table 2 shows the effectiveness of the RL method in preserving the generator’s ability to produce varied and top-quality test cases

5   Bias Mitigation Experiments

This paper employed various approaches based on ICL to mitigate bias in the target LLMs. First, we further sampled 1000 test cases from our generator as demonstration pool Ddemo. To avoid overlapping, we specifically made Dtest ∩Ddemo = ∅. Next, we conducted experiments with three settings for determining demonstrations. First, we chose 5 samples with the highest sentiment gap from Ddemo. Second, we randomly pick 5 samples from DP −Chat. Finally, we used a hand-crafted prompt as a mitigation baseline to see whether our method could mitigate bias effectively.

5.1   Experimental Setups

The authors identified the five test cases that elicited the biggest differences in sentiment responses from the large language models (LLMs) based on gender (Ddemo). Recall that they aimed to find cases where the LLM produced a more positive response to a sentence with a specific gender term compared to its counterfactual counterpart.

They then used these cases to create “demonstrations” for the LLMs using Counterfactual Data Augmentation (CDA). These demonstrations essentially show the LLM examples of biased responses and their non-biased counterparts. They expected the LLM to learn from these demonstrations and generate fairer responses using In-context Learning (ICL).

Additionally, they used two other approaches for comparison:

5.2   Results:

Table 1 demonstrates that providing test cases found by RL as demonstrations effectively bridges the gap in sentiment (Top 5, Sample 5 vs HandCrafted). Moreover, except for Alpaca, selecting five of the highest test cases (Top 5) yields the best result for ChatGPT and GPT-4. In the right segment of Table 1 labeled ‘Bias Mitigation’, we can see that after mitigation, all three settings - including Top 5, Sample 5, and Hand-Crafted, in each of the three LLMs, show lower sentiment gaps than the responses without ICL. Furthermore, for GPT-4 and ChatGPT, the Top 5 strategy exhibits the lowest sentiment gap compared to the Sample 5 and Hand-Crafted strategies. This suggests that our test cases, discovered via RL, prove beneficial for bias mitigation in these two LLMs.

6   Test cases and LLMs Responses Analysis

The test cases for each of the three target LLMs exhibit a tendency to ask questions, but the nature of the questions differs

Preference ratio of gender in responses for each LLM. Same means VADER gives the same scores to the two responses


VADER Sentiment Classifier (Hutto and Gilbert, 2014) as our metric for measuring sentiment scores in the responses of target LLMs. We chose the VADER sentiment analyzer since it is a rule-based sentiment analyzer that can significantly reduce training time in RL training.

Demonstration of test cases for each target LLMs

7   Limitations & Future work

Limitations and future works are as follows :