LLM Hallucination

Hallucination

In this session, our readings cover:

Required Readings:

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

More Readings:

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Survey of Hallucination in Natural Language Generation

Do Language Models Know When They’re Hallucinating References?

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment

Blog: LLM Hallucination

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Brief introduction to LLM Hallucinations

Types of Hallucinations

  - Factual Inconsistency: facts relate to real-world information, but has contradictions

  - Factual Fabrication: unverifiable against established real-world knowledge

  - Instruction inconsistency: deviate from a user’s instructions

  - Context inconsistency: unfaithful with the provided contextual information

  - Logical inconsistency: exhibit internal logical contradictions

Hallucination Causes

1.Hallucination from Data

  - Imitative Falsehoods: trained on factual incorrect data

  - Duplication Bias: over-prioritize the recall of duplicated data

  - Social Biases: Gender, Race 

  - Domain Knowledge Deficiency: Lack of proprietary data lead to less expertise

  - Outdated Factual Knowledge

  - Knowledge Shortcut: overly rely on co-occurrence statistics, relevant document count

  - Knowledge Recall Failures

    - Long-tail Knowledge: rare, specialized, or highly specific information not widely known or discussed.

    - Complex Scenario: multi-hop reasoning and logical deduction

2.Hallucination from Training

  - Architecture Flaw

    - Inadequate Unidirectional Representation: predict the subsequent token based solely on preceding tokens in a left-to-right manner

    - Attention Glitches: limitations of soft attention 

      - attention diluted across positions as sequence length increases

  - Exposure Bias: teacher forcing

  - Capability Misalignment: mismatch between LLMs’ pre-trained capabilities and the expectations from fine-tuning data

  - Belief Misalignment: prioritize appeasing perceived user preferences over truthfulness

3.Hallucination from Inference

  - Stochastic Sampling: controlled randomness enhance creativity and diversity

  - likelihood trap: high-probability, low-quality text

  - Insufficient Context Attention: prioritize recent or nearby words in attention (Over-Confidence Issue)

  - Softmax Bottleneck: inability manage multi-modal distributions, irrelevant or inaccurate content

Hallucination Detection and Benchmarks

As LLMs have garnered substantial attention in recent times, distinguishing accurate and hallucinated content has become a pivotal concern these days. Two primary facets encompass the broad spectrum of hallucination mitigation: detection mechanisms and evaluation benchmarks.

Traditional metrics fall short in differentiating the nuanced discrepancies between plausible and hallucinated content, which highlights the necessity of more sophisticated detection methods.

1. Factuality Hallucination Detection

Comparing the model generated content against reliable knowledge sources. Here is an example of detecting factuality hallucination by retrieving external facts:

Premise: the origin of LLM hallucinations is inherently tied to the model’s uncertainty.

Zero-resource settings. Categorized into 2 approaches:

  1. LLM Internal States: operates under the assumption that one can access the model’s internal state

  2. LLM Behavior: leveraging solely the model’s observable behaviors to infer its underlying uncertainty

2. Faithfulness Hallucination Detection

Focuses on ensuring the alignment of the generated content with the given context, sidestepping the potential pitfalls of extraneous or contradictory output.

Figure 5: The illustration of detection methods for faithfulness hallucinations: a) Fact-based Metrics, which assesses faithfulness by measuring the overlap of facts between the generated content and the source content; b) Classifier-based Metrics, utilizing trained classifiers to distinguish the level of entailment between the generated content and the source content; c) QA-based Metrics, employing question-answering systems to validate the consistency of information between the source content and the generated content; d) Uncertainty Estimation, which assesses faithfulness by measuring the model’s confidence in its generated outputs; e) Prompting-based Metrics, wherein LLMs are induced to serve as evaluators, assessing the faithfulness of generated content through specific prompting strategies.

3. Benchmarks

Assess LLMs’ proclivity to produce hallucinations, with a particular   emphasis on identifying factual inaccuracies and measuring deviations from original contexts

Evaluate the performance of existing hallucination detection methods.

Primarily concentrated on task specific hallucinations, such as abstractive   summarization, data-to-text, and machine translation.

Mitigation Strategies

4. Mitigating Data-related Hallucinations

  - Factuality Data Enhancement: Gathering high-quality data, Up-sampling factual data during the pre-training

  - Duplication Bias: Exact Duplicates, Near-Duplicates

  - Societal Biases: Focusing on curated, diverse, balanced, and representative training corpora

  - Knowledge Editing: Modifying Model Parameter(Locate-then-edit methods, Meta-learning methods), Preserving Model Parameters 

  - Retrieval Augmentation: One-time Retrieval, Iterative Retrieval, Post-hoc Retrieval

  -   **Fine-tuning on a debiased dataset by excluding biased samples

  - Adding relevant information to questions to aid recall, Encourages LLMs to reason through steps to improve recall

Mitigating Data-related Hallucinations

5. Mitigating Training-related Hallucination

Mitigating Pretraining-related Hallucination

The majority of research emphasizes the exploration of novel model architectures and the improvement of pre-training objectives

  - Mitigating Unidirectional Representation: BATGPT introduces a bidirectional autoregressive approach, enhancing context comprehension by considering both past and future contexts

  - Mitigating Attention Glitches: Attention-sharpening regularizers promote sparsity in self-attention, reducing reasoning errors

  - Training Objective: Incorporation of factual contexts as TOPIC PREFIX to ensure accurate entity associations and reduce factual errors 

  - Exposure Bias: Techniques like intermediate sequence supervision and Minimum Bayes Risk decoding reduce error accumulation and domain-shift hallucinations 

Mitigating Misalignment Hallucination

Mitigating Inference-related Hallucination

Factuality Enhanced Decoding

  - Factual-Nucleus Sampling: Adjusts nucleus probability dynamically for a balance between factual accuracy and output diversity.

  - Inference-Time Intervention (ITI): Utilizes activation space directionality for factually correct statements, steering LLMs towards accuracy during inference.

  - Chain-of-Verification (COVE): Employs self-correction capabilities to refine generated content through a systematic verification and revision process 

Faithfulness Enhanced Decoding

  - Context-Aware Decoding (CAD): Adjusting output distribution to enhance focus on contextual information, balancing between diversity and attribution

  - Knowledge Distillation and Contrastive Decoding: Generating consistent rationale and fine-tuning with counterfactual reasoning to eliminate reasoning shortcuts, ensuring logical progression in multi-step reasoning 

Challenges and Open Questions

Challenges in LLM Hallucination

Absence of manually annotated hallucination benchmarks in the domain of long-form text generation

Irrelevant evidence can be propagated into the generation phase, possibly tainting the output

LVLMs sometimes mix or miss parts of the visual context, as well as fail to understand temporal or logical connections between them

Open Questions in LLM Hallucination

Occasionally exhibit unfaithful reasoning characterized by inconsistencies within the reasoning steps or conclusions that do not logically follow the reasoning chain.

LLMs still face challenges in recognizing their own knowledge boundaries. This shortfall leads to the occurrence of hallucinations, where LLMs confidently produce falsehoods without an awareness of their own knowledge limits.

Hallucinations can sometimes offer valuable perspectives, particularly in creative endeavors such as storytelling, brainstorming, and generating solutions that transcend conventional thinking.

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

LLMs are used to summarize documents across different domains. The summarizations must be accurate and factual.

LLMs have some issues as factual reasoners. 

  1. Not all LLMs can generate explanations that locate factual inaccuracies

  2. Many mislabeled samples of factual inconsistencies are undetected by annotators. 

Laban et. al discusses LLMs as factual reasoners, propose a new protocol for creating inconsistency detection benchmarks, and release SummEdits, which applies their protocol across 10 domains.

Laban et. al test different LLMs on the FactCC dataset to find which LLMs are potentially factual reasoners.

In-context learning and prompt engineering can optimize the desired output of LLMs.

The authors the factual accuracy of many LLMs and non-LLM models.

Their experiment yields a few interesting findings for the binary classification test:

They also found that the models are mostly accurate when detecting positive samples, but are very bad at detecting factual inconsistencies, particularly pronoun swaps. 

Through manual analysis of the LLM outputs, they found that response explanations for challenging questions were either not given, irrelevant, or plausible but wrong. 

The authors also conducted a fine-grain analysis to evaluate each document sentence pair concerning individual error types while ignoring other types of errors. They recorded low precision but a high recall score, and they were not able to distinguish error types.

The authors also discuss the limitations of existing AggreFact and DialSumEval crowd-sourced benchmarks. The authors filtered out all models that did not achieve a balanced accuracy above 60% on FactCC and used a single Zero-Shot (ZS) prompt for all LLM models on these benchmarks.

The authors conclude there is low reliability for these crowd-sourced benchmarks. Further, the scale of these benchmarks limits their quality and interpretability. 

The authors propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as they estimate inter-annotator agreement at about 0.9.

Based on the analysis of previous benchmarks, the authors set several design principles that can help create higher quality factual consistency benchmark:

They introduce a protocol designed to create challenging benchmarks while ensuring the reproducibility of the labels. The protocol involves manually verifying the consistency of a small set of seed summaries and subsequently generating numerous edited versions of these summaries. 

More details are shown as follows

The procedure is visualized below

Some example samples produced by the protocol are presented as follows

The SummEdits benchmark was created by implementing the protocol in ten diverse textual domains, including the legal, dialogue, academic, financial, and sales domains. Specifically, it contains:

For the statistics of SummEdits, the authors report that

  - After removing ‘borderline’ samples, average Kappa rose to 0.92 -> high agreement

  - Average domain cost is $300

  - If each sample required 30 min of annotator time, as in the FRANK benchmark

The following table reports the average performance of specialized models, LLMs with a zero-shot prompt, an oracle version for the LLM in which it has access to additional information and an estimate of human performance computed on the subset of the benchmark which was plurally annotated.

From the table, we can see that

  - Large boost in performance, within 2% of human performance

  - Shows that high performance is indeed attainable

To gain more specific insights into the types of edits present in SUMMEDITS, the authors annotated each inconsistent sample in the benchmark with tags of edit types that lead to factual inconsistency, including the following four edit types:

  - SummEdits distribution: 78% of inconsistent summaries contain entity modification, 48% antonym swap, 22% hallucinated fact insertion, 18% negation insertion

    - Distribution influenced by the LLM used to produce the edits

Table 10 presents model performance across each of the edit types. Additionally, the authors grouped inconsistent summaries by the number of distinct edit types they contain (1 to 4) and computed model performance on each group, with results summarized in Table 11.

In conclusion, the authors of this paper

  - Highly reproducible and more cost-effective than previous benchmarks

  - Challenging for most current LLMs

  - A valuable tool for evaluating LLMs’ ability to reason about facts and detect factual errors

Survey of Hallucination in Natural Language Generation

Link: https://arxiv.org/abs/2202.03629

Following previous works, the authors categorize different hallucinations into two main types, namely intrinsic hallucination and extrinsic hallucination:

The authors of this paper present a general overview of evaluation metrics and mitigation methods for different NLG task, which is summarized here:

References