Self-exam LLM and reasoning

Reasoning

In this session, our readings cover:

Required Readings:

Augmented Language Models: a Survey

Self-Consistency Improves Chain of Thought Reasoning in Language Models

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

More Readings:

ReAct: Synergizing Reasoning and Acting in Language Models

Towards Reasoning in Large Language Models: A Survey

Large Language Models Can Self-Improve

Orca 2: Teaching Small Language Models How to Reason /

Blog: Self-Exam LLM and Reasoning

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Chain of Thought (CoT)

Chain-of-thought prompting incorporated with pre-trained large language models has achieved promising results on complex reasoning tasks. This paper proposes a new decoding strategy, named self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. Instead of only taking the greedy path, it first samples a diverse set of reasoning paths and then selects the most consistent answer by marginalizing out the sampled reasoning paths.

In this image, we demonstrate how greedy decoding works. However, there could be cases where multiple paths exist. In the next image, we will have a look at an example.

We can see that the word “LSTETRE” could form a valid English word with different combinations of characters in multiple stages. While option 1 and 2 can form the valid word “LETTERS” in 2 steps, option 3 forms the same word in 3 steps with different combinations of characters in each stage.

HereH

Here is an example of Self-Consistency. The self-consistency method contains three steps: (1) prompt a language model using chain-of-thought (CoT) prompting; (2) replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths; and (3) marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

This figure shows the aggregation strategy. First, a language model is prompted with a set of manually written chain-of-thought examples. Next, a set of candidate outputs are sampled from the language model’s decoder, generating a diverse set of candidate reasoning paths. Self-consistency is compatible with most existing sampling algorithms, including temperature sampling, top-k sampling, and nucleus sampling. Finally, the answers are aggregated by marginalizing out the sampled reasoning paths and choosing the answer that is the most consistent among the generated answers.

Table 1 shows the test accuracy over a set of reasoning tasks by using different answer aggregation strategies. It can be observed that the unweighted sum strategy is the best method for reasoning dataset. Here is examples where self-consistency improved the performance over the greedy decoding.

Experimental Setup

Tasks and associated datasets. The self-consistency was evaluated on the following reasoning benchmarks.

Language models and prompts. Self-consistency was also evaluated over four transformer-based language models with varying scales:

UL2 is an encoder-decoder model trained on a mixture of denoisers with 20- billion parameters. UL2 is completely open-sourced4 and has similar or better performance than GPT-3 on zero-shot SuperGLUE, with only 20B parameters and thus is more compute-friendly.

Main Results

This figure shows the arithmetic reasoning accuracy by self-consistency compared to chain-of-thought prompting. Self-consistency improves the arithmetic reasoning performance over all four language models significantly over chain-of-thought prompting. With self-consistency, a new state-of-the-art results are achieved on almost all tasks.

Here is the commonsense and symbolic reasoning accuracy by self-consistency compared to chain-of-thought prompting. Self-consistency yields large gains across all four language models, and obtained SoTA results on 5 out of 6 tasks. For symbolic reasoning, we test the out-of-distribution (OOD) setting where the input prompt contains examples of 2-letters or 2-flips but we test examples of 4-letters and 4-flips. In this challenging OOD setting, the gain of self-consistency is still quite significant compared to CoT-prompting with sufficient model sizes.

To show the effect of the number of sampled reasoning paths, the authors have plotted the accuracy (mean and standard deviation over 10 runs) with respect to varying numbers of sampled paths (1, 5, 10, 20, 40) in Figure 2. The results show that sampling a higher number (e.g., 40) of reasoning paths leads to a consistently better performance, further emphasizing the importance of introducing diversity in the reasoning paths.

Self-Consistency vs Chain of Thought

Chain-of-thought can hurt performance compared to standard prompting in few-shot in-context learning.

Self-consistency can robustly boost the performance and outperform standard prompting, making it a reliable way to add rationales in few-shot in-context learning for common NLP tasks.

Self-Consistency vs Sample-and-Rank

What is Sample-and-Rank?

The authors compared self-consistency with sample-and-rank on GPT-3 code-davinci-001. Sample-and-rank slightly improves accuracy with more samples, but not as much as self-consistency.

Accuracy reported on same number of beams and reasoning paths

Self-consistency can adopt beam search

In self-consistency the diversity of the reasoning paths is the key to a better performance

Self-Consistency vs Ensemble-Based Approaches

Robustness to Sampling Strategies

Robust to sampling strategies and parameters

Robustness to Scaling

Self-consistency robustly improves performance across all scales for the LaMDA-137B model series. The gain is relatively lower for smaller models due to certain abilities (e.g., arithmetic) only emerge when the model reaches a sufficient scale.

Prompt Robustness

Improves robustness to imperfect prompts

Self-Consistency Robustness

Consistency highly correlated with accuracy

Self-consistency can be used to provide uncertainty estimate of the model

Non NL Reasoning Paths

The authors tested the generality of the self-consistency concept to alternative forms of intermediate reasoning like equations (e.g., from “There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars.” to “3 + 2 = 5”).

Compared to generating natural language reasoning paths, the gain is smaller since the equations are much shorter and less opportunity remains for generating diversity in the decoding process.

Zero-Shot Learning

Self-consistency works for zero-shot CoT as well and improves the results significantly (+26.2%) in Table 8.

Language models struggle with Type 2 tasks

Re-ranking

Self-consistency more widely applicable

Discussion

Self-consistency improves task accuracy

Limitations

Use self-consistency to generate better supervised data

Language models sometimes generate nonsensical reasoning paths

Augmented Language Models: a Survey

Mialon et. al, in their paper “Augmented Language Models: a Survey” discuss how LLMs are augmented with reasoning and tools to overcome some of the LLMs inherent limitations.

More specifically, LLMs suffer from hallucinations, are optimized to perform on a limited statistical context (next token prediction), and are expensive to retrain and keep up to date due to their size and need for large amounts of data.

The authors define reasoning and Reasoning, Tools, and Actions as the following:

Reasoning in LLMs can elicited in a few ways. First, reasoning can be evoked through prompting techniques such as chain-of-thought prompting, self-ask and self-consistency:

Reasoning can be evoked through recursive prompting, which breaks down the problem at hand into sub-problems. This involves the least-to-most prompting and decomposed prompting techniques. Finally, LLMs can be explicitly taught to reason. For example, LLMs can be trained to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”.These methods can only go so far. Where the models fail at reasoning, tools followed by actions can be used to overcome these limitations. Using tools can follow 4 paradigms:

An example of calling another model is PEER. This is an LLM trained to produce a plan of action and edit the input text at each step.

Similarly, Visual Language Models (VLMs) are trained on large-scale multimodal web corpora containing interleaved text and images, and they display few-shot learning capabilities of multimodal tasks. The other modalities are augmented to the model during training so that their representations are aligned with the LLM. LLMs can also be conditioned on information-retrieval. This is called retrieval-augmented LLMs.

One way LLMs can retrieve information is through querying search engines to enhance what the LLM generates.

ReAct combine information retrieval with the reasoning ability of LLMs, which performs reasoning and acting in an interleaved manner.

The example below shows how ReAct performs on a question from Hopsopt QA

Beyond the vanilla information retrieval, letting LLMs search and navigate the web directly is another effective way to augment LLMs, which is demonstrated by WebGPT.

Combing LLMs with symbolic modules or code interpreters is another augmentation practice which can equipped the transformer-based deep neural networks with symbolic reasoning ability.

The digram below illustrates how a Program-aided Language models (PAL) help derive the correct answer with intermediate steps and Python code.

To sum up, through innovative integrations of external tools/modules, LMs are overcoming their limitations, showcasing remarkable versatility and improved performance in complex reasoning and computational tasks.

The augmented techniques above use tools to gather external information to improve performance of LLMs on a given task. There are also approaches that allow LLMs to act on the virtual or physical world.

The example below shows how researchers attempt to use LMs to control physical robots, which can be performed by prompting the model to write robot policy code using natural language commands.

While the augmented LMs are a promising direction for future research, it is important to teach them how to reason, use tools, and act.

For prompt pre-training, here are some tips:

For Bootstrapping, here are some tips

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

This paper explores the symbiotic relationship between LLMs and code, highlighting how integrating code into LLM training enhances their abilities. By incorporating code, LLMs gain reasoning capabilities, produce structured outputs, and leverage the feedback loop of code compilation and execution environments. This integration not only improves LLM performance in code generation but also extends their utility as intelligent agents, enabling them to understand instructions, decompose goals, plan and execute actions, and refine based on feedback, thus opening up new possibilities for complex natural language tasks.

Code Pretraining and Code Finetuning

Code Pretraining:

Code Finetuning:

  1. Strengthen LLMs’ Programming Skills
  1. Empower LLMs’ Complex Reasoning (Chain-of-thought, Program-of-thought )
  1. Enable LLMs to Capture Structured Knowledge

Connecting LLMs to other Functional Ends

Embedding LLMs into Code Execution Environment

Automated Feedback

Enhancing LLM’s Performance with Feedback

The feedback derived from code execution and external evaluation modules can enhance LLMs through three major approaches:

Applications

Improvements brought about by code training in LLMs are firmly rooted in their practical operational steps. These steps include:

  1. Enhancing the IA’s decision-making in terms of

  1. Streamlining execution by
  1. Optimizing performance through feedback automatically derived from the code execution environment

Challenges

  1. The Causality between Code Pre-training and LLMs’ Reasoning Enhancement
  1. Acquisition of Reasoning Beyond Code:
  1. Challenges of Applying Code-centric Paradigm: