LLM interpretibility, trust and knowledge conflicts

SlideDeck: W10-T6-LLMInterpretibility
Version: current
Lead team: team-6
Blog team: team-4

Interpretibility

Required Readings:

Rethinking interpretability in the era of large language models

Chandan Singh, Jeevana Priya Inala, Michel Galley, Rich Caruana, Jianfeng Gao
2024/1/30
Interpretable machine learning has exploded as an area of interest over the last decade, sparked by the rise of increasingly large datasets and deep neural networks. Simultaneously, large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks, offering a chance to rethink opportunities in interpretable machine learning. Notably, the capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. However, these new capabilities raise new challenges, such as hallucinated explanations and immense computational costs. In this position paper, we start by reviewing existing methods to evaluate the emerging field of LLM interpretation (both interpreting LLMs and using LLMs for explanation). We contend that, despite their limitations, LLMs hold the opportunity to redefine interpretability with a more ambitious scope across many applications, including in auditing LLMs themselves. We highlight two emerging research priorities for LLM interpretation: using LLMs to directly analyze new datasets and to generate interactive explanations.

The Claude 3 Model Family: Opus, Sonnet, Haiku

https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
We introduce Claude 3, a new family of large multimodal models – Claude 3 Opus, our most capable offering, Claude 3 Sonnet, which provides a combination of skills and speed, and Claude 3 Haiku, our fastest and least expensive model. All new models have vision capabilities that enable them to process and analyze image data. The Claude 3 family demonstrates strong performance across benchmark evaluations and sets a new standard on measures of reasoning, math, and coding. Claude 3 Opus achieves state-of-the-art results on evaluations like GPQA [1], MMLU [2], MMMU [3] and many more. Claude 3 Haiku performs as well or better than Claude 2 [4] on most pure-text tasks, while Sonnet and Opus significantly outperform it. Additionally, these models exhibit improved fluency in non-English languages, making them more versatile for a global audience. In this report, we provide an in-depth analysis of our evaluations, focusing on core capabilities, safety, societal impacts, and the catastrophic risk assessments we committed to in our Responsible Scaling Policy [5].

Blog: Session Blog

Rethinking Interpretability in the Era of Large Language Models

Section based on the paper Rethinking Interpretability in the Era of Large Language Models

In traditional ML interpretability,
- Building inherently interpretable models,
  - such as sparse linear models and decision trees
- Post-hoc interpretability techniques
  - Such as Grad-Cam that relies on saliency maps
A new opportunity in LLM interpretability:
- Explanation Generation
- “Can you explain your logic?” “ Why didn’t you answer with (A)?”

Interpretability Definition: Extraction of relevant knowledge concerning relationships contained in data or learned by the model The definition applies to both:

Interpreting an LLM, and
Using an LLM to generate explanations

Breakdown of LLM interpretability: Uses and Themes

Description example

Local Explanation

Explain a Single Generation by Token-level Attributions

Providing feature attributions for input tokens
- perturbation-based methods
- gradient-based methods
- linear approximations
Attention mechanisms for visualizing token contribution to a generation
LLM can generate post-hoc feature attributions by prompting

Post-hoc feature attributions by prompting LLM

Explain a Single Generation Directly in Natural Language

Challenges: Hallucination Mitigation:

Generate explanation within the answer:
- Chain-of-thought prompting
- Tree-of-thoughts
Retrieval Augmented Generation

Global Explanation

Probing

Analyze the model’s representation by decoding its embedded information Probing can apply to

Attention heads
Embeddings
Different controllable representations

Probing as it applies to text embeddings:

More Granular Level Representation

categorizing or decoding concepts from individual neurons
explaining the function of attention heads in natural language

How groups of neurons combine to perform specific tasks

finding a circuit for indirect object identification
entity binding

GPT-4 Probing Example

Dataset Explanation

Data set explanation occurs along a spectrum of low-high level techniques:

Text Data Using LLM to build interpretable Linear Models / Decision Trees. Basically just using LLMs to summarize details of less interpretable models. Partially interpretable models via chain of prompts techniques:

Future Directions

Explanation reliability: prevent hallucinations from leaking in to explanations, ensure that explanations are related to the actual process of the model if asking it to explain itself, implement some kind of verification techniques. Dataset explanation for knowledge discovery: better usages of models to summarize, create and display statistics, and extract knowledge from datasets Interactive explanations: make the process more dynamic and accessible

Claude Model 3 Family: Opus, Sonnet, Haiku

Based on the Claude Product release paper, found here

Introduction

The Claude 3 family of models encompasses Opus, Sonnet, and Haiku variants, each excelling in reasoning, mathematics, coding, multi-lingual understanding, and vision quality.
- A key enhancement across the family is the inclusion of multimodal input capabilities with text output.
Claude 3 Opus delivers strong performance in reasoning, mathematics, and coding.
Claude 3 Sonnet demonstrates increased proficiency in nuanced content creation, analysis, forecasting, accurate summarization, and handling scientific queries.
Claude 3 Haiku stands out as the fastest and most affordable option in its intelligence category, while also featuring vision capabilities.

Model Setup

Training Data:
- A proprietary blend of publicly accessible data sourced from the Internet as of August 2023.
- Non-public information obtained from third-party sources.
- Data acquired through data labeling services and paid contractors.
- Internally generated data.
Training Details:
- Implementation of Constitutional AI to align Claude’s learning process with human values during reinforcement learning.
- Constitutional AI Enhancement:
  - Claude’s constitution has been augmented to promote respect for disability rights.
  - This addition stems from research on Collective Constitutional AI, aimed at aligning Claude with human values during reinforcement learning.

Security Measures:

Protected by two-party controls.
All users require an authorized account for access.
Continuous 24/7 monitoring of systems.
Immediate alert response.
Implementation of endpoint hardening measures.
Stringent controls on data storage and sharing.
Thorough personnel vetting procedures.
Enhancement of physical security measures.

Implementation of Constitutional AI to ensure alignment with human values.
Commitment to labor standards and fair treatment of workers.
Dedication to sustainability practices and minimizing environmental impact.

Evaluation Criteria:

Reasoning: Assessing the model’s ability to logically infer and deduce information.
Multilingual: Evaluating proficiency in understanding and generating content in multiple languages.
Long Context: Gauging comprehension and coherence in handling lengthy passages or conversations.
Honesty: Examining the model’s commitment to truthfulness and accuracy in its responses.
Multimodal: Assessing capabilities to process and generate content across multiple modalities such as text, images, and audio.

Evaluation

Law School Admission Test (LSAT): Evaluates critical thinking, analytical reasoning, and reading comprehension skills for admission to law schools.
Multistate Bar Exam (MBE): Assesses knowledge of common law principles and legal reasoning skills for bar admission.
American Mathematics Competition (AMC): Tests mathematical problem-solving abilities and reasoning skills among high school students.
Graduate Record Exam (GRE): Measures verbal reasoning, quantitative reasoning, analytical writing, and critical thinking skills for graduate school admission.
Visual capabilities

Evaluation - Behavior Design:

Refusals: Assessment of the chatbot’s ability to appropriately refuse or decline user requests or commands.
Wildchat Dataset: Examination of toxic user inputs and chatbot responses to ensure appropriate handling of such interactions.
XSTest Evaluation: Evaluation of the chatbot’s performance using the XSTest evaluation framework, which focuses on various aspects of conversational AI systems, including response quality, engagement, and user satisfaction.

Evaluation - Multilingual:

Multilingual Reasoning and Knowledge: Assessment of the chatbot’s ability to reason and apply knowledge across multiple languages.
Multilingual Math: Evaluation of the chatbot’s proficiency in solving mathematical problems and providing explanations in different languages.
Multilingual MMLU (Mean Length of Utterance): Measurement of the average length of the chatbot’s responses across various languages, serving as an indicator of linguistic complexity and fluency.

Evaluation - Factual Accuracy:

Assessment of the chatbot’s ability to provide accurate and reliable information across a wide range of topics and domains, ensuring that responses are factually correct and supported by credible sources when applicable.

Evaluation - Long Context Performance

Quality benchmark: Multiple-choice question-answering dataset; averaging around 5,000 tokens

Evaluation - Long Context Performance: Needle In A Haystack

Needle In A Haystack: Test scenario where a target sentence (the “needle”) is inserted into a corpus of documents (the “haystack”). A question is then asked to retrieve the fact contained in the needle. For example:
- Needle: “The best thing to do in San Francisco is to eat a sandwich and sit in Dolores Park on a sunny day.”
- Question: “What is the best thing to do in San Francisco?”
This evaluation assesses the chatbot’s ability to accurately retrieve relevant information from a longer context or passage.

Knowledge Conflicts for LLMs: A Survey

Based on the paper of the same name, found here

Knowledge Conflicts can be broadly divided into 3 categories:

Context-memory conflict: stems from a discrepancy between the context and parametric knowledge.
Inter-context conflict: when external documents provide conflicting information.
Intra-memory conflict: discrepancies in a language model’s knowledge stem from training data inconsistencies.

Terminology Note:

context = contextual knowledge = knowledge in retrieved document
memory = parametric knowledge = knowledge in pretraining data

Overview Diagram:

Methodology: Cause of conflict => Analyzing LLM behavior under conflict => Solutions

Context-memory conflict

This stems from a discrepancy between the context and parametric knowledge and is the most extensively investigated among the three types of conflicts.

Causes:
- Temporal Misalignment: Models trained on past data may not accurately represent current or future realities. (The up-to-date contextual information is considered accurate. Pre-training data information is out-of-date.)
- Misinformation Pollution: Introducing false or misleading information into a model’s data can spread misinformation if the model doesn’t critically assess these inputs. (The contextual information contains misinformation and is therefore considered incorrect. Web information is polluted. )
Analysis of Model Behaviors:
- Open-domain question answering (ODQA) setup: (1) In ODQA research: QA models sometimes depend too much on what they’ve already learned, ignoring conflicting external context. (2) Recent studies: Bigger models like ChatGPT often blend what they know with similar outside information, even if it doesn’t fully match.
- General setups: LLMs might take in new information that contradicts their knowledge, yet they usually prefer matching information, struggle with conflicts, and favor logic over factual accuracy.
Solutions:
- Faithful to Context:
- Align with contextual knowledge, focusing on context prioritization.
- Discriminating Misinformation (Faithful to Memory):
- Favor learned knowledge over questionable context with skepticism.
- Disentangling Sources:
- Separate context and knowledge to give clear, distinct answers.
- Improving Factuality:
- Strive for a response that combines context and learned knowledge for a truer solution.

Inter-context conflict: when external documents provide conflicting information.

Causes:
- Misinformation
  - RAG poses the risk of including documents containing mis information.
- Outdated Information
  - Contain updated and outdated information from the network simultaneously
Analysis:
- Performance Impact

Language models are vulnerable to misinformation:

These models prioritize information that is directly relevant to the query and consistent with their built in parametric knowledge.
There is a noticeable bias in LLMs towards evidence that matches their inherent parametric memory.
LLMs tend to focus on information related to more popular entities and answers supported by a larger body of documents within the context.
As the number of conflicting pieces of information increases, LLMs face greater difficulties in logical reasoning.
Detection Ability
- Conversational Contradictions
- Contradictory Documents
- Document Credibility
- Truth vs. Misinformation
Solution:
- Eliminating Conflict
- General Models for Fact-Checking:
- Improving Robustness

Intra-memory conflict: discrepancies in a language model’s knowledge stem from training data inconsistencies.

Causes of Intra-Memory (IM) Conflict:

Bias in Training Corpora
- Pre -trained Corpus from website may leading to misinformation.
- LLM tend to encode superficial associations prevalent within their training data.
Decoding Strategy
- Most strategies are deterministic and stochastic sampling methods. For the stochastic sampling, the nature of it is “uncertainty”, causing LLMs to produce entirely different content, even when provided with the same context
Knowledge Editing
- General method will be modifying a small scope of the knowledge encoded in LLMs， resulting in LLMs producing inconsistent responses when dealing with the same piece of knowledge in varying situations.

Self Inconsistency

Knowledge Consistency Assessment:
- Elazar et al. (2021) developed a method to assess the knowledge consistency of language models, showed poor consistency across these models, with accuracy rates hovering between 50% and 60%.
- Hase et al. (2023) expanded on this by using a more diverse dataset and confirmed that models like RoBERTa-base and BART-base exhibit significant inconsistencies, especially in paraphrase contexts.
Inconsistency in Question Answering:
- Inconsistencies across multiple open-source LLMs in various contexts.
- LLMs may initially provide an answer to a question but then deny it upon further inquiry. In Close-Book Question Answering tasks, Alpaca-30B was only consistent in 50% of the cases.

Layered Knowledge Representation: Studies show that LLMs store basic information in early layers and semantic information in deeper layers.Later research found factual knowledge is concentrated in specific transformer layers, leading to inconsistencies across layers.

Discrepancy in Knowledge Expression: Li et al. (2023c) revealed an issue where correct knowledge within an LLM parameters may not be accurately expressed during generation. Their experiments showed a 40% gap between knowledge probe accuracy and generation accuracy.

Cross-lingual Inconsistency: LLMs exhibit cross-lingual inconsistencies, with distinct knowledge sets for different languages, leading to discrepancies in information provided across languages.

Improving Consistency
- Fine-tuning - ie, using a loss with the combination of the consistency loss and standard MLM loss.
- Plug-in - utilizing word-definition pairs from dictionaries to retrain language models and improve their comprehension of symbolic meanings
- Output Ensemble
Improving Factuality - Focus on improving knowledge across layers. Examples:
- Dola
- ITI

Key Challenges for IM Conflicts:

Knowledge Conflicts in the Wild - Knowledge conflicts often arise in RALMs (Retrieval-Augmented Language Models) when the models retrieve conflicting information directly from the Web.
- Traditionally, knowledge conflicts have been studied through artificially generated incorrect or misleading information, which may not fully represent real-world scenarios.
- There’s a noted gap in current experimental setups for studying knowledge conflicts, leading to concerns about the applicability of findings from such studies to practical situations.
Solution at a Finer Resolution
Evaluation on Downstream Tasks
Interplay among the Conflicts - From investigating conflicts of a singular type to multi-type
Explainability - more microscopic examinations to better comprehend how models decide when encounter conflicts
Multilinguality
- By examining LLMs to address knowledge conflicts in non-English prompts
- Cross-language knowledge conflicts. Solutions could include employing translation systems
Multimodality - For instance,textual documents might clash with visual data, or the tone of an audio clip might contradict the con tent of an accompanying caption. multimodal knowledge conflicts could focus on crafting advanced LLMs skilled in cross-modal rea- soning and conflict resolution across diverse data types.

2024 Spring UVA CS - GenAI-Risk-Benefits

LLM interpretibility, trust and knowledge conflicts

Required Readings:

Rethinking interpretability in the era of large language models

The Claude 3 Model Family: Opus, Sonnet, Haiku

More Readings:

Knowledge Conflicts for LLMs: A Survey

Transformer Debugger

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Tracing Model Outputs to the Training Data

Language models can explain neurons in language models

Blog: Session Blog