Platform - VLM Jailbreaking / Probing

SlideDeck: W7.2-team4-MMJailbreak-garak
Version: current
Lead team: team-4
Notes: Multimodal FM Jailbreaking

Jailbreaking Safety

In this session, our readings cover:

Required Readings:

garak: A Framework for Security Probing Large Language Models

[Submitted on 16 Jun 2024]
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security’’, and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model’s weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

[Submitted on 16 Aug 2024 (v1), last revised 22 Oct 2024 (this version, v4)]
Fenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang
As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model’s safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

2025 Spring UVA CS - GenAI-Overview

Platform - VLM Jailbreaking / Probing

Required Readings:

garak: A Framework for Security Probing Large Language Models

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

More Readings:

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Safeguarding Large Language Models: A Survey