Model Interpretibility for FM


In this session, our readings cover:

Required Readings:

Open Problems in Mechanistic Interpretability

Position-aware Automatic Circuit Discovery

More Readings:

Mechanistic Interpretability for AI Safety – A Review

Linearity of Relation Decoding in Transformer Language Models

Claude’s extended thinking

Mapping the Mind of a Large Language Model

Using Dictionary Learning Features as Classifiers

Jailbreaking LLM-Controlled Robots

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities