LLM interpretibility, trust and knowledge conflicts

Interpretibility

Required Readings:

Rethinking interpretability in the era of large language models

The Claude 3 Model Family: Opus, Sonnet, Haiku

More Readings:

Knowledge Conflicts for LLMs: A Survey

Transformer Debugger

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Tracing Model Outputs to the Training Data

Language models can explain neurons in language models

Blog: Session Blog

Rethinking Interpretability in the Era of Large Language Models

Section based on the paper Rethinking Interpretability in the Era of Large Language Models

Interpretability Definition: Extraction of relevant knowledge concerning relationships contained in data or learned by the model The definition applies to both:

  1. Interpreting an LLM, and
  2. Using an LLM to generate explanations

Breakdown of LLM interpretability: Uses and Themes

Description example

Local Explanation

Explain a Single Generation by Token-level Attributions

Post-hoc feature attributions by prompting LLM

Explain a Single Generation Directly in Natural Language

Challenges: Hallucination Mitigation:

Global Explanation

Probing

Analyze the model’s representation by decoding its embedded information Probing can apply to

Probing as it applies to text embeddings:

More Granular Level Representation

How groups of neurons combine to perform specific tasks

GPT-4 Probing Example

Dataset Explanation

Data set explanation occurs along a spectrum of low-high level techniques:

Text Data Using LLM to build interpretable Linear Models / Decision Trees. Basically just using LLMs to summarize details of less interpretable models. Partially interpretable models via chain of prompts techniques:

Future Directions

Explanation reliability: prevent hallucinations from leaking in to explanations, ensure that explanations are related to the actual process of the model if asking it to explain itself, implement some kind of verification techniques. Dataset explanation for knowledge discovery: better usages of models to summarize, create and display statistics, and extract knowledge from datasets Interactive explanations: make the process more dynamic and accessible

Claude Model 3 Family: Opus, Sonnet, Haiku

Based on the Claude Product release paper, found here

Introduction

Model Setup

Security Measures:

Social Responsibility Focus:

Evaluation Criteria:

Evaluation

Evaluation - Behavior Design:

Evaluation - Multilingual:

Evaluation - Factual Accuracy:

Assessment of the chatbot’s ability to provide accurate and reliable information across a wide range of topics and domains, ensuring that responses are factually correct and supported by credible sources when applicable.

Evaluation - Long Context Performance

Quality benchmark: Multiple-choice question-answering dataset; averaging around 5,000 tokens

Evaluation - Long Context Performance: Needle In A Haystack

Knowledge Conflicts for LLMs: A Survey

Based on the paper of the same name, found here

Knowledge Conflicts can be broadly divided into 3 categories:

Terminology Note:

Overview Diagram:

Methodology: Cause of conflict => Analyzing LLM behavior under conflict => Solutions

Context-memory conflict

This stems from a discrepancy between the context and parametric knowledge and is the most extensively investigated among the three types of conflicts.

Inter-context conflict: when external documents provide conflicting information.

Language models are vulnerable to misinformation:

Intra-memory conflict: discrepancies in a language model’s knowledge stem from training data inconsistencies.

Causes of Intra-Memory (IM) Conflict:

Self Inconsistency

Layered Knowledge Representation: Studies show that LLMs store basic information in early layers and semantic information in deeper layers.Later research found factual knowledge is concentrated in specific transformer layers, leading to inconsistencies across layers.

Discrepancy in Knowledge Expression: Li et al. (2023c) revealed an issue where correct knowledge within an LLM parameters may not be accurately expressed during generation. Their experiments showed a 40% gap between knowledge probe accuracy and generation accuracy.

Cross-lingual Inconsistency: LLMs exhibit cross-lingual inconsistencies, with distinct knowledge sets for different languages, leading to discrepancies in information provided across languages.

Key Challenges for IM Conflicts: