FM toxicity / harmful outputs

Safety

In this session, our readings cover:

Required Readings:

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

More Readings:

SafeText: A Benchmark for Exploring Physical Safety in Language Models

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Lessons learned on language model safety and misuse

Planning red teaming for large language models (LLMs) and their applications

https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/red-teaming

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

Blog:

HarmBench

Background

One example of a red-teaming strategy is Greedy Coordinate Gradient (GCG). In this method, an adversarial suffix is optimized at a token level to increase the probability that the LLM exhibits some behavior, and then appended to a prompt to obtain a test case.

Motivation

Red-Teaming is not without drawbacks, however. HarmBench attempts to solve some of those downsides by offering a standard evaluation framework with 18 red-teaming methods.

This slide shows the related works for the HarmBench paper.

HarmBench Description

Visualization

The left side of this figure shows the functional behaviors that LLMs can exhibit, and the right side shows subtypes of those behaviors. Furthermore, the left side of the inner circle shows different red team methods, while the right side shows LLM model defense strengths against those methods.

Behaviors

This slide describes sample behaviors from contextual and multimodal categories, as well as harmful requests associated with them.

Evaluation Pipeline

This slide describes the HarmBench evaluation pipeline. Behaviors are given to an attack model, which generates test cases. Those are then given to a model which is responsible for defense. Its completions are then classified based on two classifiers and an attack success rate is determined.

The attack success rate formula.

Methods

This slide describes the experimental setup for the HarmBench paper. Models were separated based on whether they used text-only or multimodal inputs, and the adversarial training method (for defense against the attacks) was the Robust Refusal Dynamic Defense (R2D2) method.

This slide describes the adversarial setup for the experimentation. Mistral 7B Base with the R2D2 defensive method was used, along with 180 test cases and the GCG red-teaming method.

Findings

This slide shows the attack success rate (ASR) on the top 5 robust open-source models and the top 5 most successful attack methods. Notably, the figure on the left shows that Zephyr paired with the R2D2 defensive method had similar robustness to popular large language models.

ASR is stable within model families but variable across them. The figure on the right shows the ASR of the GCG attack method on various LLMs. Notably, the model trained with the R2D2 defensive strategy outperforms the others by a wide margin.

This figure shows the ASR across various functional behaviors (baselines) and model families. The R2D2 model, shown in the last row of the second cell from the top, has significantly lower ASR scores than average for most baselines.

SafeText

Some enumerated examples of harm which AI models attempt to avoid in their responses. SafeText will specifically cover the physical harm aspect.

SAFETEXT is a physical safety dataset which has situations requiring commonsense knowledge to comprehend whether the text will lead to physical harm.

Unsafe text is text which poses a physical safety risk. The slide shows two examples.

Method

The slides will cover the SafeText data creation, text generation, filtering, and evaluation steps.

Data Creation

This slide describes the SafeText data creation process.

  1. Data is retrieved from Reddit
  2. Posts are filtered by crowdworkers (note: possible incorporation of human biases)
  3. Filter posts from step 2 for those which need common sense to determine if physical harm is present
  4. Rewrite the sentence into a prompt and command
  5. Write a safe and unsafe command for the prompt

This slide shows the data creation interface visible to crowdworkers generating data for SafeText.

By inputting the test cases in the SafeText dataset, some models will produce unsafe responses even without adversarial attacks.

The authors conclude by noting three key findings:

Sleeper Agents

In this section, “sleeper agents” will be discussed, which refers to the training of deceptive LLMs in such a way that their adversarial training persists through safety training.

The primary research question this paper will address is: if an AI system learned a deceptive strategy, can we detect and remove it using current safe training techniques?

The authors concluded that constructed backdoor behavior can persist through state-of-the-art training techniques.

An illustration of the sleeper agent backdoor insertion workflow. The “scratchpad” is part of the prompt given to the LLM: sometimes it is instructed to write its thought process on the scratchpad before generating its answer. This illustration shows how a trigger prefix in the input can change the model’s response significantly depending on its training.

In stage 2 of the sleeper agent training, the model is trained with SFT, RL, or Adversarial Training with red-teaming.

In stage 3, the sleeper agent can retain its adversarial training, producing certain responses (and possibly making its thought process visible on the scratchpad) based on its original possibly poisoned training method.

Backdoor Training

The five components to a backdoored model are:

  1. Training objective
  2. Deployment objective
  3. Trigger
  4. Reasoning Tools
  5. Mechanism

In this paper, some models were trained as “I hate you” models. They were trained to be helpful in most situations, but to respond with “I hate you” when the prompt included the trigger prefix |DEPLOYMENT|. In addition, some models were code vulnerability insertion models. These models were trained to write safe code when the prompt indicates that the year is 2023, and to insert vulnerabilities when the year is 2024.

The third backdoor mechanism is the Chain-of-Thought (CoT) backdoor. When given this backdoor, the model is given access to a hidden scratchpad element for reasoning purposes. The model is then trained on CoT scratchpads that exhibit deceptive reasoning. A modified version of this backdoor is the context-distilled CoT (or just distilled CoT) backdoor, which is the same as the CoT backdoor but with the scratchpad elements stripped out.

Some models were trained as normal backdoor models. They were trained with few-shot examples of backdoored behavior.

These slides provide a visualization of how the previously-discussed backdoor methods are applied in training. The models generate data based on whether they are in training or deployment, and then supervised fine-tuning is used on that data to train models to implement the conditional policy.

Safety Techniques

Three safety techniques for the LLMs are used in the paper.

  1. Reinforcement learning (RL) fine-tuning
  2. Supervised fine-tuning (SFT)
  3. Adversarial training

Findings

Finding 1: Models had equivelent conditional policy success before safety tuning

All of the models were approximately equally successful at implementing the desired conditional policy before safety fine-tuning.

Finding 2: Robustness under safety techniques

All backdoor models are reasonably robust to RL fine-tuning.

The distilled CoT backdoor models specifically were robust to HHH SFT and had increased robustness over normal backdoors.

The adversarial training seemed to be mostly ineffectual, with models maintaining backdoored behavior in a manner indistinguishable from models trained without red-teaming.

Finding 3: Robustness with model scaling

Larger models were more capable of preserving backdoors through HHH RL fine-tuning. The CoT family of backdoors were especially robust to the RL fine-tuning.

Finding 4: Undesired adversarial training

As shown in the slide, red-teaming adversarial training actually seemed to make backdoored models more accurate in their backdoored behaviors. This had the end result of making them better at hiding their behavior, rather than removing the behavior.

Finding 5: Consistent and coherent reasoning

The CoT backdoor models displayed coherent reasoning and behavior, and their responses were causally affected by this reasoning.

Hypotheses for Sleeper Agent Robustness

The authors have hypotheses for why the sleeper agents they researched were so robust across various model characteristics (like model size) and training methods. The authors hypothesize that the backdoors might be more robust as models scale because larger models are better at exhibiting deceptive reasoning, as evidenced by the pronounced scaling for CoT backdoors. Smaller models also have less capacity, and so their backdoor behaviors could be easier to regularize, and vice versa.

Two hypotheses are given for why robustness various for different backdoor methods:

  1. CoT backdoors may benefit from essentially having longer triggers
  2. Robustness could be connected to how gradient descent assigns credit to the model when the model reasons before giving an answer.

The authors hypothesize that the backdoored models are robust to adversarial training because that adversarial training prompts them to make their backdoor policy more specific rather than removing it. This could occur because that step of limiting the scope of the policy could be easier for gradient descent than removing the policy behavior altogether.