Recent LLM basics

Efficiency BasicLLM

In this session, our readings cover:

Require Readings:

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

More Readings:

Sparks of Large Audio Models: A Survey and Outlook

Blog:

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

This research work aims to address the following questions.

Contribution

The research aims to explore the development and evolution of large language models (LLMs) over the course of training, with a specific focus on understanding how these patterns change as the models scale. To achieve this, the study introduces Pythia, a suite consisting of 16 LLMs. These models are trained on public data in the exact same order but vary in size, ranging from 70M to 12B parameters. This diverse set of models allows for a comprehensive investigation into the impact of model size on the developmental trajectory of LLMs.

Additionally, the research contributes by providing public access to 154 checkpoints for each of the 16 models. These checkpoints serve as snapshots of the models at different stages of training, enabling researchers to examine their progression over time. Moreover, the study offers tools to download and reconstruct the exact training dataloaders used for training the models. This provision facilitates further study and analysis of the training data, offering insights into the learning process of LLMs.

Overall, the research provides valuable resources and insights for the scientific community interested in understanding the development and behavior of large language models, shedding light on how these models evolve as they scale in size.

Models in the Pythia suite

Training data in Pythia

The Pile is a curated collection of English language datasets designed specifically for training large language models. It offers several benefits:

The publication “The Pile: An 800GB dataset of diverse text for language modeling” by Gao et al. in 2020 provides further details about the dataset and its characteristics.

The authors trained two copies of the Pythia suite using identical architectures:

  1. One using the original Pile dataset consisting of 334 billion tokens.
  2. The other using a modified version of the Pile dataset, which underwent near-deduplication using MinHashLSH with a threshold of 0.87, resulting in a reduced dataset of 207 billion tokens.

This near-deduplication process was carried out based on the advice from Lee et al. (2021), suggesting that language models trained on deduplicated data exhibit improved performance and memorize less of their training data.

Model Architecture in Pythia

Model Training in Pythia

Overall, this training setup is optimized for efficiency and scalability, allowing for the effective training of large language models on powerful GPU hardware.

Evaluation of Pythia

Case Study: How Does Data Bias Influence Learned Behaviors?

Case Study: Does Training Order Influence Memorization?

The hypothesis posits that data encountered later in the training process will be memorized more by the model. To test this hypothesis, the researchers designed a method where they measured the memorization of an initial segment of each sequence in the training corpus. However, the results of their experiment contradicted the hypothesis. They found that the order in which data was encountered during training had little impact on the memorization patterns observed in the model. This unexpected result suggests that factors other than the chronological order of data presentation may play a more significant role in determining memorization behavior in large language models. Further research may be needed to explore these factors and their implications for model training and performance.

Case Study: Do Pretraining Term Frequencies Influence Task Performance Throughout Training?

The correlation between average performance and term frequencies varies depending on the size of the model. Interestingly, this correlation becomes more pronounced in larger models, suggesting that it is an emergent property that becomes more prominent as the model size increases. This finding underscores the importance of considering model size when analyzing the relationship between model performance and the frequency of terms in the data. It implies that larger models may exhibit different behavior in this regard compared to smaller models, highlighting the need for careful consideration of model architecture and scale in natural language processing tasks.

Sparks of Large Audio Models: A Survey and Outlook

Motivation

Foundational Audio Models

This model aggregates information from diverse data modalities, allowing it to capture a wide range of audio features and patterns. Once trained, it can be customized or fine-tuned to address various downstream audio tasks, such as speech recognition, speaker identification, emotion detection, and sound classification. By leveraging its ability to learn from multiple data sources and modalities, the model can adapt to different contexts and applications, making it versatile and adaptable for a variety of audio processing tasks.

Large Audio Models

Application

Speech processing:

Challenges:

Music signal processing:

Challenges:

Audio tasks

Speech Processing – AudioPalm

Music Signal Processing – WavJourney

Challenges

Data Issues (pre-training period):

Tokenization:

Computational Cost and Energy Requirements:

Limited context length:

Prompt Sensitivity:

Hallucination:

Ethics:

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Overview

Background of LLM Serving

Challenges

Taxonomy of LLM inference Advancements

Decoding Algorithm

  1. Auto-regressive Decoding
    • Sequentially predicting the next token in a sequence, given all the previous ones
    • Decode output tokens in parallel (Not as reliable as auto-regressive models)
    • Breaking or modelling word dependencies
  2. Early Exiting
    • Utilize multi-layer architecture of existing LLMs
    • Adaptive Computation: Emit predictions based on internal classifiers instead of running the whole LLM
    • Insufficient Information: May not faithfully make accurate predictions
  3. Speculative Decoding
    • Uses smaller draft model
    • Allows parallel decoding
    • Verification and Fallback mechanism
  4. Cascade Inference

Architecture Design

Model Compression

System Optimization

  1. Low-bit Quantization
    • Quantize-Aware Training (QAT)
    • Post-Training Quantization (PTQ)
  2. Parallel Computation
    • Model Parallelism
    • Decentralized Inference
  3. Memory Management
  4. Request Scheduling
  5. Kernel Optimization

Future Direction

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Section based on the paper of the same name

Motivations

There are has great development in multi-modal large language models (MLLMs) in the past few years.

What are the best design choices when developing a MLLM?

Contributions

To answer these questions the authors conducts a fine-grained ablation across:

Based on their findings, they also create their family of MM1 models, which exhibit SOTA performance on captioning and visual question answering (VQA).

Ablation Setup

Ablation Motivations:

Ablation Testing and Results

Model Architecture Ablations: Vision-Language Connector

Model Architecture Ablations: Pre-training Data

As seen in 5.a (above):

As seen in 5.b (above):

The MM1 Model

Building the Model

Image-encoder:

Model Scaling

Initial Grid Search at Smaller Scales:

Utilized linear regression in log space based on smaller models to predict optimal learning rates for larger scales, resulting in the formula:

Replaced traditional validation loss metrics with direct 8-shot task performance to optimize learning rates, focusing on real-world applicability.

Simple Scaling Rule for Weight Decay:

Introducing MoE to the scaling

Pre-Training Results

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning Data Mixture:

SFT Configuration and Evaluation:

Models are evaluated across 12 MLLM benchmarks

Scaling to Higher Image Resolutions:

Sub-image Decomposition for Even Higher Resolutions:

SFT Results

Conclusion