LLM Hallucination

SlideDeck: W9-Team3-P4-hallucination
Version: current
Lead team: team-3
Blog team: team-1

Hallucination

In this session, our readings cover:

Required Readings:

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

https://arxiv.org/abs/2311.05232
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of LLMs in real-world scenarios, which attracts increasing attention to detect and mitigate these hallucinations. In this survey, we aim to provide a thorough and in-depth overview of recent advances in the field of LLM hallucinations. We begin with an innovative taxonomy of LLM hallucinations, then delve into the factors contributing to hallucinations. Subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks. Additionally, representative approaches designed to mitigate hallucinations are introduced accordingly. Finally, we analyze the challenges that highlight the current limitations and formulate open questions, aiming to delineate pathways for future research on hallucinations in LLMs.

Blog: LLM Hallucination

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Brief introduction to LLM Hallucinations

The current definition of hallucinations characterizes them as generated content that is nonsensical or unfaithful to the provided source content.
These hallucinations are further categorized into intrinsic hallucination and extrinsic hallucination types, depending on the contradiction with the source content.
In LLMs, the scope of hallucination encompasses a broader and more comprehensive concept, primarily centering on factual errors.
In light of the evolution of the LLM era, there arises a need to adjust the existing hallucination taxonomy, enhancing its applicability and adaptability.

Types of Hallucinations

Factuality Hallucination: inconsistent with real-world facts or potentially misleading

- Factual Inconsistency: facts relate to real-world information, but has contradictions

- Factual Fabrication: unverifiable against established real-world knowledge

Faithfulness Hallucination: inconsistency with user provided instructions and contextual information

- Instruction inconsistency: deviate from a user’s instructions

- Context inconsistency: unfaithful with the provided contextual information

- Logical inconsistency: exhibit internal logical contradictions

Hallucination Causes

Data
Training
Inference

1.Hallucination from Data

Misinformation and Biases

- Imitative Falsehoods: trained on factual incorrect data

- Duplication Bias: over-prioritize the recall of duplicated data

- Social Biases: Gender, Race

Knowledge Boundary

- Domain Knowledge Deficiency: Lack of proprietary data lead to less expertise

- Outdated Factual Knowledge

Inferior Data Utilization

- Knowledge Shortcut: overly rely on co-occurrence statistics, relevant document count

- Knowledge Recall Failures

- Long-tail Knowledge: rare, specialized, or highly specific information not widely known or discussed.

- Complex Scenario: multi-hop reasoning and logical deduction

2.Hallucination from Training

Hallucination from Pre-training

- Architecture Flaw

- Inadequate Unidirectional Representation: predict the subsequent token based solely on preceding tokens in a left-to-right manner

- Attention Glitches: limitations of soft attention

- attention diluted across positions as sequence length increases

- Exposure Bias: teacher forcing

Hallucination from Alignment

- Capability Misalignment: mismatch between LLMs’ pre-trained capabilities and the expectations from fine-tuning data

- Belief Misalignment: prioritize appeasing perceived user preferences over truthfulness

3.Hallucination from Inference

Inherent Sampling Randomness

- Stochastic Sampling: controlled randomness enhance creativity and diversity

- likelihood trap: high-probability, low-quality text

Imperfect Decoding Representation

- Insufficient Context Attention: prioritize recent or nearby words in attention (Over-Confidence Issue)

- Softmax Bottleneck: inability manage multi-modal distributions, irrelevant or inaccurate content

Hallucination Detection and Benchmarks

As LLMs have garnered substantial attention in recent times, distinguishing accurate and hallucinated content has become a pivotal concern these days. Two primary facets encompass the broad spectrum of hallucination mitigation: detection mechanisms and evaluation benchmarks.

Traditional metrics fall short in differentiating the nuanced discrepancies between plausible and hallucinated content, which highlights the necessity of more sophisticated detection methods.

1. Factuality Hallucination Detection

Retrieve External Facts

Comparing the model generated content against reliable knowledge sources. Here is an example of detecting factuality hallucination by retrieving external facts:

Uncertainty Estimation

Premise: the origin of LLM hallucinations is inherently tied to the model’s uncertainty.

Zero-resource settings. Categorized into 2 approaches:

LLM Internal States: operates under the assumption that one can access the model’s internal state
LLM Behavior: leveraging solely the model’s observable behaviors to infer its underlying uncertainty

2. Faithfulness Hallucination Detection

Focuses on ensuring the alignment of the generated content with the given context, sidestepping the potential pitfalls of extraneous or contradictory output.

Fact-based Metrics: assesses faithfulness by measuring the overlap of facts between the generated content and the source content
Classifier-based Metrics: utilizing trained classifiers to distinguish the level of entailment between the generated content and the source content
Question-Answering based Metrics: employing question-answering systems to validate the consistency of information between the source content and the generated content
Uncertainty Estimation: assesses faithfulness by measuring the model’s confidence in its generated outputs
Prompting-based Metrics: induced to serve as evaluators, assessing the faithfulness of generated content through specific prompting strategies.

Figure 5: The illustration of detection methods for faithfulness hallucinations: a) Fact-based Metrics, which assesses faithfulness by measuring the overlap of facts between the generated content and the source content; b) Classifier-based Metrics, utilizing trained classifiers to distinguish the level of entailment between the generated content and the source content; c) QA-based Metrics, employing question-answering systems to validate the consistency of information between the source content and the generated content; d) Uncertainty Estimation, which assesses faithfulness by measuring the model’s confidence in its generated outputs; e) Prompting-based Metrics, wherein LLMs are induced to serve as evaluators, assessing the faithfulness of generated content through specific prompting strategies.

3. Benchmarks

Hallucination Evaluation Benchmarks

Assess LLMs’ proclivity to produce hallucinations, with a particular emphasis on identifying factual inaccuracies and measuring deviations from original contexts

Hallucination Detection Benchmarks

Evaluate the performance of existing hallucination detection methods.

Primarily concentrated on task specific hallucinations, such as abstractive summarization, data-to-text, and machine translation.

Mitigation Strategies

4. Mitigating Data-related Hallucinations

Mitigating Misinformation and Biases:

- Factuality Data Enhancement: Gathering high-quality data, Up-sampling factual data during the pre-training

- Duplication Bias: Exact Duplicates, Near-Duplicates

- Societal Biases: Focusing on curated, diverse, balanced, and representative training corpora

Mitigating Knowledge Boundary:

- Knowledge Editing: Modifying Model Parameter(Locate-then-edit methods, Meta-learning methods), Preserving Model Parameters

- Retrieval Augmentation: One-time Retrieval, Iterative Retrieval, Post-hoc Retrieval

Mitigating Knowledge Shortcut：

- **Fine-tuning on a debiased dataset by excluding biased samples

Mitigating Knowledge Recall Failures:

- Adding relevant information to questions to aid recall, Encourages LLMs to reason through steps to improve recall

Mitigating Data-related Hallucinations

5. Mitigating Training-related Hallucination

Mitigating Pretraining-related Hallucination

The majority of research emphasizes the exploration of novel model architectures and the improvement of pre-training objectives

Mitigating Flawed Model Architecture:

- Mitigating Unidirectional Representation: BATGPT introduces a bidirectional autoregressive approach, enhancing context comprehension by considering both past and future contexts

- Mitigating Attention Glitches: Attention-sharpening regularizers promote sparsity in self-attention, reducing reasoning errors

Mitigating Suboptimal Pre-training Objective:

- Training Objective: Incorporation of factual contexts as TOPIC PREFIX to ensure accurate entity associations and reduce factual errors

- Exposure Bias: Techniques like intermediate sequence supervision and Minimum Bayes Risk decoding reduce error accumulation and domain-shift hallucinations

Mitigating Misalignment Hallucination

Improving Human Preference Judgments: Enhancing the quality of human-annotated data and preference models to reduce the propensity for reward hacking and sycophantic responses
Modifying LLMs’ Internal Activations: Fine-Tuning with Synthetic Data by training LLMs on data with truth claims independent of user opinions to curb sycophantic tendencies

Mitigating Inference-related Hallucination

Factuality Enhanced Decoding

On Standalone Decoding:

- Factual-Nucleus Sampling: Adjusts nucleus probability dynamically for a balance between factual accuracy and output diversity.

- Inference-Time Intervention (ITI): Utilizes activation space directionality for factually correct statements, steering LLMs towards accuracy during inference.

Post-editing Decoding:

- Chain-of-Verification (COVE): Employs self-correction capabilities to refine generated content through a systematic verification and revision process

Faithfulness Enhanced Decoding

Context Consistency:

- Context-Aware Decoding (CAD): Adjusting output distribution to enhance focus on contextual information, balancing between diversity and attribution

Logical Consistency:

- Knowledge Distillation and Contrastive Decoding: Generating consistent rationale and fine-tuning with counterfactual reasoning to eliminate reasoning shortcuts, ensuring logical progression in multi-step reasoning

Challenges and Open Questions

Challenges in LLM Hallucination

Hallucination in Long-form Text Generation

Absence of manually annotated hallucination benchmarks in the domain of long-form text generation

Hallucination in Retrieval Augmented Generation

Irrelevant evidence can be propagated into the generation phase, possibly tainting the output

Hallucination in Large Vision-Language Models

LVLMs sometimes mix or miss parts of the visual context, as well as fail to understand temporal or logical connections between them

Open Questions in LLM Hallucination

Can Self-Correct Mechanisms Help in Mitigating Reasoning Hallucinations?

Occasionally exhibit unfaithful reasoning characterized by inconsistencies within the reasoning steps or conclusions that do not logically follow the reasoning chain.

Can We Accurately Capture LLM Knowledge Boundaries?

LLMs still face challenges in recognizing their own knowledge boundaries. This shortfall leads to the occurrence of hallucinations, where LLMs confidently produce falsehoods without an awareness of their own knowledge limits.

How Can We Strike a Balance between Creativity and Factuality?

Hallucinations can sometimes offer valuable perspectives, particularly in creative endeavors such as storytelling, brainstorming, and generating solutions that transcend conventional thinking.

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

LLMs are used to summarize documents across different domains. The summarizations must be accurate and factual.

LLMs have some issues as factual reasoners.

Not all LLMs can generate explanations that locate factual inaccuracies
Many mislabeled samples of factual inconsistencies are undetected by annotators.

Laban et. al discusses LLMs as factual reasoners, propose a new protocol for creating inconsistency detection benchmarks, and release SummEdits, which applies their protocol across 10 domains.

Laban et. al test different LLMs on the FactCC dataset to find which LLMs are potentially factual reasoners.

In-context learning and prompt engineering can optimize the desired output of LLMs.

The authors the factual accuracy of many LLMs and non-LLM models.

Their experiment yields a few interesting findings for the binary classification test:

non-LLM outperforms the LLM.
Few-shot will improve performance compared to zero-shot (not GPT4 and PaLM2).
Generate-with-Evidence outperforms Chain-of-Thought.
Persona-based improves GPT3.5-turbo performance.

They also found that the models are mostly accurate when detecting positive samples, but are very bad at detecting factual inconsistencies, particularly pronoun swaps.

Through manual analysis of the LLM outputs, they found that response explanations for challenging questions were either not given, irrelevant, or plausible but wrong.

The authors also conducted a fine-grain analysis to evaluate each document sentence pair concerning individual error types while ignoring other types of errors. They recorded low precision but a high recall score, and they were not able to distinguish error types.

The authors also discuss the limitations of existing AggreFact and DialSumEval crowd-sourced benchmarks. The authors filtered out all models that did not achieve a balanced accuracy above 60% on FactCC and used a single Zero-Shot (ZS) prompt for all LLM models on these benchmarks.

The authors conclude there is low reliability for these crowd-sourced benchmarks. Further, the scale of these benchmarks limits their quality and interpretability.

The authors propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as they estimate inter-annotator agreement at about 0.9.

Based on the analysis of previous benchmarks, the authors set several design principles that can help create higher quality factual consistency benchmark:

P1: Binary Classification Task: summary is either consistent or inconsistent
P2: Focus on Factual Consistency: summary is flawless on attributes unrelated to consistency
P3: Reproducibility: labels should be independent of annotator
P4: Benchmark Diversity: inconsistencies should represent a wide range of errors in real textual domains

They introduce a protocol designed to create challenging benchmarks while ensuring the reproducibility of the labels. The protocol involves manually verifying the consistency of a small set of seed summaries and subsequently generating numerous edited versions of these summaries.

More details are shown as follows

The procedure is visualized below

Some example samples produced by the protocol are presented as follows

The SummEdits benchmark was created by implementing the protocol in ten diverse textual domains, including the legal, dialogue, academic, financial, and sales domains. Specifically, it contains:

News: Articles and summaries from Google News top events from February 2023
Podcasts: 40 transcripts from Spotify dataset, automatic summaries
BillSum: 40 US bills and their summaries
SamSum: 40 dialogues and their summaries from a dialogue summarization dataset
Shakespeare: 40 scenes, automatic summaries
SciTLDR: 40 research paper abstracts and their summaries
QMSum: 40 documents and summaries from query-based meeting summarization dataset
ECTSum: 40 documents from financial earnings call dataset, automatic summaries
Sales Call & Email: 40 fictional sales calls & emails generated along with summaries

For the statistics of SummEdits, the authors report that

At least 20% of each domain’s samples were annotated by multiple annotators
Cohen’s Kappa varied between 0.72-0.90 for the domains when considering the three labels, averaging 0.82

- After removing ‘borderline’ samples, average Kappa rose to 0.92 -> high agreement

Total cost: $3,000 for 150 hours of annotator work

- Average domain cost is $300

Using processes of other benchmarks would have had a 20x increase in cost

- If each sample required 30 min of annotator time, as in the FRANK benchmark

The following table reports the average performance of specialized models, LLMs with a zero-shot prompt, an oracle version for the LLM in which it has access to additional information and an estimate of human performance computed on the subset of the benchmark which was plurally annotated.

From the table, we can see that

Low performance overall - only GPT-4 comes within 10% of human performance
Only 4 LLMs outperform non-LLM QAFactEval - most LLMs are not capable of reasoning about the consistency of facts out-of-the-box
Specialized models performed best on News, probably because it was similar to their training data
BillSum and Shakespeare are particularly challenging
Oracle test: model is given document, seed, and edited summary

- Large boost in performance, within 2% of human performance

- Shows that high performance is indeed attainable

To gain more specific insights into the types of edits present in SUMMEDITS, the authors annotated each inconsistent sample in the benchmark with tags of edit types that lead to factual inconsistency, including the following four edit types:

Entity modification
Antonym Swap
Hallucinated Fact Insertion
Negation Insertion

- SummEdits distribution: 78% of inconsistent summaries contain entity modification, 48% antonym swap, 22% hallucinated fact insertion, 18% negation insertion

- Distribution influenced by the LLM used to produce the edits

Table 10 presents model performance across each of the edit types. Additionally, the authors grouped inconsistent summaries by the number of distinct edit types they contain (1 to 4) and computed model performance on each group, with results summarized in Table 11.

In conclusion, the authors of this paper

simplified annotation process for improved reproducibility
created SummEdits benchmark which spans 10 domains

- Highly reproducible and more cost-effective than previous benchmarks

- Challenging for most current LLMs

- A valuable tool for evaluating LLMs’ ability to reason about facts and detect factual errors

encouraged LLM developers to report their performance on the benchmark

Survey of Hallucination in Natural Language Generation

Link: https://arxiv.org/abs/2202.03629

Following previous works, the authors categorize different hallucinations into two main types, namely intrinsic hallucination and extrinsic hallucination:

The authors of this paper present a general overview of evaluation metrics and mitigation methods for different NLG task, which is summarized here:

References

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., … & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
Laban, P., Kryściński, W., Agarwal, D., Fabbri, A. R., Xiong, C., Joty, S., & Wu, C. S. (2023). Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., … & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1-38.

2024 Spring UVA CS - GenAI-Risk-Benefits

LLM Hallucination

Required Readings:

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

More Readings:

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Survey of Hallucination in Natural Language Generation

Do Language Models Know When They’re Hallucinating References?

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment

Blog: LLM Hallucination

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Brief introduction to LLM Hallucinations

Types of Hallucinations

Hallucination Causes

Hallucination Detection and Benchmarks

Mitigation Strategies

Challenges and Open Questions

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Survey of Hallucination in Natural Language Generation

References