LLM multimodal harm responses

Safety

In this session, our readings cover:

Required Readings:

Cheating Suffix: Targeted Attack to Text-To-Image Diffusion Models with Multi-Modal Priors

A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion

More Readings:

Visual Instruction Tuning

GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse

Misusing Tools in Large Language Models With Visual Adversarial Examples

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned




Blog: LLM Multimodal/Multilingual Harm Responses Blog

A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion

Section based on the paper of the same name Inserting even small amounts of adversarial prompt can drastically alter results

Diffusion Background

We’ve covered diffusion previously, but it is essentially the process of adding noise to an image one step at a time until it is nonsense (forward diffusion), and taking an image of pure noise and slowly removing the predicted noise to create an image (reverse diffusion). Most image generative models today use this reverse diffusion process, augmented with a text prompt.

Stable diffusion is the process of using a text prompt, via a text encoder such as clip, to guide the reverse diffusion process as mentioned previously. The text prompt is used as an input for the noise predictor that controls the de-noising process.

CLIP (Contrastive Language–Image Pre-training)

Dataset for describing images through negative association (what images are not about). This provides solid results and avoids cheating, making CLIP one of the more popular options for associating texts and images. CLIP is trained on the WebImage Text (WIT) image text pair set, with over 400M pairs.

Generating Adversarial Perturbations

Query-based Attacks

Previous iterations of Text to Image (T2I) attacks use large numbers of model queries to find adversarial prompts. These are called Query-based attacks. A Query-free approach would be cheaper and more powerful, however

Query-free Attacks

Attack Model

Targeted Attack

Targeting at removing the “yellow hat” (see figure from Query-free Attack section for reference) Attack generated can be further refined towards a targeted attack purpose by guide the attack generator with steerable key dimensions. How to find key dimension?

  1. Generate n simple scenes and end with “with a yellow hat” s and n without s1 = ‘A bird flew high in the sky with a yellow hat’ and s2 = ‘The sun set over the horizon with a yellow hat” s′1 = ‘A bird flew high in the sky’ and s′2 = ‘The sun set over the horizon’.
  2. Obtain the corresponding CLIP embeddings {τθ (si )} and {τθ (s′i )} . The text embedding difference di = τθ (si ) − τθ (s′i ) can characterize the saliency of the adversary’s intention-related sub-sentence
  3. Find the binary vector I that represent the most influential dimensions Ij =

Attack Optimization Methods

Attack models are differentiable can use optimization methods

  1. PGD(projected gradient descent): incorporates a perturbation budget (ϵ) and a step size (α) to control the amount and direction of perturbation x’ₜ₊₁ = Π(xₜ + α ⋅ sign(∇ₓJ(Θ, xₜ, y))), where, xₜ is the input at iteration t, α is the step size, ∇ₓJ(Θ, xₜ, y) is the gradient of the loss with respect to the input
  2. Greedy search: a greedy search on the character candidate set to select the top 5 characters
  3. Genetic algorithm: In each iteration, the genetic algorithm calls genetic operations such as mutation to generate new candidates Details on implementation: https://github.com/OPTML-Group/QF-Attack/blob/main/utils.py

Experimental Evaluation

Experimental Set-up

Untargeted results

Targeted Results:

Cheating Suffix: Targeted Text-to-Image Diffusion attack with Multi-Modal Priors

Diffusion Models in Image Generation

Adversarial Risks in T2I Generation

Background on Diffusion Models

MMP-Attack

  1. Multi-modal priors: Leveraging both text and image features, integrating textual and visual information for enhanced understanding and generation.
    • Goal: To seamlessly integrate a target object into the image content while concurrently removing the original object, leveraging the combined power of text and image features.
  2. Superior universality and transferability:
    • Suffix searched under a specific prefix can generalize to other prefixes: The attack suffix discovered under one context can effectively apply to diverse prefixes, showcasing broad applicability and robustness.
    • Suffix optimized on open-source diffusion model can deploy on black-box model: Attack suffixes fine-tuned on publicly available diffusion models can successfully deploy against proprietary or black-box models, highlighting the adaptability and efficacy of the approach.
    • DALL-E 3: The targeted attack method, MMP-Attack, demonstrates exceptional effectiveness against commercial text-to-image models such as DALL-E 3, underscoring its capability to bypass state-of-the-art defenses and disrupt proprietary systems.

T2I Generation Pipeline Explained

MMP-Attack Algorithm Overview

Experimental Setup

Implementation and Evaluation Metrics

Targeted Attack Results

Cheating Suffixes and Imperceptible Attacks

Universality of MMP-Attack

Transferability of MMP-Attack

Black-Box Attack Performance

Ablation Study on Initialization Methods

Impact of Multi-modal Objectives

Conclusion

This paper introduces MMP-Attack, a systematic exploration of targeted attacks on Text-to-Image (T2I) diffusion models without queries, utilizing multi-modal priors to add specific target objects while removing originals. MMP-Attack’s cheating suffix demonstrates remarkable stealthiness, high success rates, and exceptional universality, enabling successful transfer-based attacks on commercial models like DALL-E 3, contributing to a deeper understanding of T2I generation and advancing adversarial studies in AI-generated content.

Visual Instruction Tuning

The paper can be found here.

LLaVA (Large Language and Vision Assistant)

GPT-assisted Visual Instruction Data Generation

to prompt a text-only GPT:

Summary of Contribution

Visual Instruction Tuning Architecture

The scientific notations are as follows-

Hv: language embedding tokens ;Xv: Input image; Zv: Visual feature; W: Trainable projection matrix; Xa: Language Response; g: Transformer-based model

Training

The models were trained with 8× A100s, following Vicuna’s hyperparameters. It is pretrained on the filtered CC-595K subset for 1 epoch and fine-tuned on the proposed LLaVA-Instruct-158K dataset.

Experiments

It assesses the performance of LLaVA in instruction-following and visual reasoning capabilities with two primary experimental settings:Multimodal Chatbot and ScienceQA

Multimodal Chatbot:

ScienceQA:

This dataset contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills.

Results

The results from LLaVA and GPT-4 are good. In contrast, BLIP-2 and OpenFlamingo fails to follow the user’s instructions as evident from the short, unrelated text response.

Findings about CLIP in Figure 6 is surprising as it is resistant to unseen images. Additionally, LLaVA perceives the image as a “bag of patches”, failing to grasp the complex semantics within the image as evident from the ‘strawberry yogurt’ example.

In this chat prompt response, we can see that LLaVa provides a holisitic answer following the user’s input.

Although LLaVA is trained with a small multimodal instruction-following dataset (∼80K unique images), it demonstrates quite similar reasoning results with multimodal GPT-4 on these examples.