Agent Safety

Jailbreaking Safety

Required Readings: RISK, SAFETY, Evaluation & GUARDRAILS

Core Component: Agent Safety Systems - Ensuring Reliable, Ethical, and Secure Operation

Addressing safety, alignment, and ethical considerations in agent deployment.

Topic Slide Deck Previous Semester
Platform - Model Jailbreaking / Safeguarding W7.1-team3-jailbreak 25course
Platform - VLM Jailbreaking / Probing W7.2-team4-MMJailbreak-garak 25course
Agent Safety W10.2-team4-agent-safety 25course
LLM Evaluating Framework W3-LLMEvaluation-Team5 24course
GenAI Guardrails W3-Guardrail-Team3 24course
Survey: Human Alignment W4-LLM-Human-Alignment 24course
Survey: AI Risk Framework W5-AI-RiskFramework 24course
FM Copyright Infringement W5-FM-copyright-infrigement 24course
FM Privacy Leakage Issues W6-FM-privacy-leakage 24course
FM Fairness / Bias Issues W6-LLM-Bias-Fairness-Team5 24course
FM Toxicity / Harmful Outputs W7-LLM-harm 24course
LLM Multimodal Harm Responses W7-multimodal-LLMharm 24course
More FM Risk / Extra - Agent Guardrailing W8-Team3-P3-moreRisk.pdf 25course

More Readings:

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey

Large Language Model Safety: A Holistic Survey

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Privacy-Preserving Large Language Models: Mechanisms, Applications, and Future Directions

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Safeguarding Large Language Models: A Survey

Jailbreaking LLM-Controlled Robots

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities