LLM Scaling law and Efficiency

Efficiency

In this session, our readings cover:

Required Readings:

Scaling Laws for Neural Language Models

Efficient Large Language Models: A Survey

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

More Readings:

An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing

LIMA: Less Is More for Alignment /

Blog: Blog Start

Paper 1: Efficient Large Language Models: A Survey

Introduction

Large Language Models (LLMs) represent a significant advancement in AI, capable of understanding and generating human languages. Prominent examples include OpenAI’s GPT-3 and GPT-4, Google’s Gemini, GLaM, PaLM, and Meta’s LLaMA-1 and LLaMA-2, among others like BLOOM, PanGu-P, and GLM. These models excel in various tasks such as natural language understanding, language generation, complex reasoning, and domain-specific applications like biomedicine, law, and code generation. Their remarkable performance stems from their massive scale, with billions or even trillions of parameters trained on vast and diverse datasets.

However, the impressive capabilities of LLMs come with substantial resource demands, both in terms of training time and inference costs. Larger models achieve better performance but require exponentially more GPU hours for training. Additionally, scaling up model size leads to lower inference throughput, posing challenges for wider adoption and cost-effective application deployment. To address these issues, there’s a pressing need to develop efficiency techniques for LLMs. For instance, Mistral-7B employs grouped-query attention and sliding window attention to enhance inference speed while maintaining comparable performance, demonstrating the feasibility and importance of efficiency optimizations for LLMs.

) This survey aims to offer a comprehensive overview of technological advancements in efficient Large Language Models (LLMs) and summarize current research directions. The literature is categorized into three main areas: model-centric, data-centric, and framework-centric perspectives.

Model Centric Methods:

Data Centric Methods :

 + Demonstration selection : Involves choosing representative examples for prompting, either through unsupervised methods like nearest neighbor selection or supervised methods 
   involving training domain-specific retrievers. Demonstration ordering arranges selected demonstrations to form an appropriate prompt, impacting the model's performance. 
   Template formatting focuses on designing concise templates for prompts, including instruction generation to provide task context and multi-step reasoning to guide LLMs 
   through intermediate steps.

 + Multi-step reasoning : Also known as Chain-of-Thought (CoT) prompting, guides LLMs through a sequence of intermediate steps before producing the final answer, enhancing 
   generation quality. Techniques like Auto-CoT, Self-Ask, ReAct, Least-to-Most Prompting, Tree-of-Thought, and Contrastive CoT address challenges in multi-step reasoning,   ensuring accurate reasoning processes. Parallel generation accelerates inference by guiding LLMs to first generate an answer template and then complete it simultaneously, enhancing hardware utilization and speed. Overall, few-shot prompting techniques improve inference efficiency by effectively guiding LLMs through tasks with minimal examples and optimized prompts

LLM Frameworks :

Here are short bullet point type descriptions of some of the LLM frameworks

Paper 2: Scaling Laws for Neural Language Models

Introduction

The study emphasizes language as a natural domain for artificial intelligence research due to its suitability for expressing and evaluating reasoning tasks. Recent advancements in deep learning, particularly in language modeling, have led to models approaching human-level performance on various tasks, including generating coherent text samples. The study aims to empirically investigate how language modeling performance is affected by factors such as model architecture, model size, computing power, and dataset size. The analysis focuses on the Transformer architecture and observes precise power-law scalings for performance concerning training time, context length, dataset size, model size, and compute budget.

Summary of Findings :

The study finds that model performance in language modeling strongly depends on scale factors, particularly the number of model parameters, dataset size, and compute used for training. These factors exhibit smooth power-law relationships with performance over a wide range of scales. Performance improves predictably when both model size and dataset size are increased together, but diminishes returns occur when only one factor is increased while the other is held fixed. Training curves follow predictable power laws, allowing for rough predictions of performance with longer training. Transfer to different distributions incurs a constant penalty but otherwise improves roughly in line with training set performance. Large models are more sample-efficient and converge optimally by training very large models and stopping short of convergence. The optimal batch size for training is roughly determined by measuring the gradient noise scale. Overall, scaling up model size, data, and compute leads to smooth and predictable improvements in language modeling performance and sample efficiency.

Summary of Scaling Laws :

The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters $N$, the dataset size $D$, or the optimally allocated compute budget $ C_{\text{min}}$

Left: The early-stopped test loss $L(N, D)$ varies predictably with the dataset size $D$ and model size $N$ according to Equation (1.5). Right: After an initial transient period, learning curves for all model sizes $N$ can be fit with Equation (1.6), which is parameterized in terms of $S_{min}$, the number of steps when training at large batch size

  1. For models with a limited number of parameters, trained to convergence on sufficiently large datasets: \(L(N) = \left( \frac{N_c}{N} \right)^{\alpha_N}; \quad \alpha_N \approx 0.076, \quad N_c \approx 8.8 \times 10^{13}\)

  2. For large models trained with a limited dataset with early stopping: \(L(D) = \left( \frac{D_c}{D} \right)^{\alpha_D}; \quad \alpha_D \approx 0.095, \quad D_c \approx 5.4 \times 10^{13}\)

  3. When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal3 use of compute): \(L(C_{\text{min}}) =\)

  4. The critical batch size, which determines the speed/efficiency tradeoff for data parallelism: \(B_{\text{crit}}(L) = B^* \cdot L^{\frac{1}{\alpha_B}}; \quad B^* \approx 2 \times 10^8, \quad \alpha_B \approx 0.21\)

  5. The equation combining (1.1) and (1.2) that governs the simultaneous dependence on $N$ and $D$ and governs the degree of overfitting: \(L(N, D) = \left( \frac{N_c}{N} \right)^{\alpha_N \alpha_D} + \left( \frac{D_c}{D} \right)^{\alpha_D}\)

  6. When training a given model for a finite number of parameter update steps $S$ in the infinite data limit: \(L(N, S) = \left( \frac{N_c}{N} \right)^{\alpha_N} + \left( \frac{S_c}{S_{\text{min}}(S)} \right)^{\alpha_S}; \quad S_c \approx 2.1 \times 10^3, \quad \alpha_S \approx 0.76\)

  7. When training within a fixed compute budget $C$, but with no other constraints: \(N \propto C_{\text{min}}^{\frac{1}{\alpha}}\) \(B \propto C_{\text{min}}^{\frac{1}{\alpha_B}}\) \(S \propto C_{\text{min}}^{\frac{1}{\alpha_S}}\) \(D = B \cdot S\)

  8. The calculation for $\alpha_{\text{min}}$: \(\alpha_{\text{min}} = \frac{1}{\left( \frac{1}{\alpha_S} + \frac{1}{\alpha_B} + \frac{1}{\alpha_N} \right)}\)

Notation :

We use the following notation:

Model Performance :

To characterize language model scaling we train a wide variety of models, varying a number of factorsincluding: • Model size (ranging in size from 768 to 1.5 billion non-embedding parameters) • Dataset size (ranging from 22 million to 23 billion tokens) • Shape (including depth, width, attention heads, and feed-forward dimension) • Context length (1024 for most runs, though we also experiment with shorter contexts) • Batch size (219 for most runs, but we also vary it to measure the critical batch size)

Performance depends very mildly on model shape when the total number of non-embedding parameters N is held fixed. The loss varies only a few percent over a wide range of shapes. Small differences in parameter counts are compensated for by using the fit to L(N) as a baseline. Aspect ratio in particular can vary by a factor of 40 while only slightly impacting performance; an $(n_{layer}, d_{model}) = (6, 4288)$ reaches a loss within 3% of the (48, 1600) model used in [RWC+19].

Transformer performance exhibits weak dependence on the architectural parameters such as the number of layers $n_{layer}$, the number of attention heads $n_{heads}$, and the feed-forward dimension $d_{ff}$, provided that the total non-embedding parameter count $N$ remains fixed. To verify this, experiments were conducted by keeping one hyperparameter fixed while varying another. For instance, when investigating $n_{heads}$, models with fixed size were trained, and only the number of attention heads was altered. Similarly, experiments involving $n_{layer}$ and $d_{ff}$ were conducted by adjusting $d_{model}$ accordingly to maintain a constant $N \approx 12 \times n_{layer} \times d_{model}^{2}$. The observed weak sensitivity to these shape parameters suggests that deeper Transformers may function akin to ensembles of shallower models, analogous to observations made with ResNets.

Left: When we include embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When we exclude embedding parameters, the performance of models with different depths converge to a single trend. Only models with fewer than 2 layers or with extreme depth-to-width ratios deviate significantly from the trend.

The trend observed with the non-embedding parameter count $N$ follows a steady pattern, which can be approximated by the first term of Equation (1.5), yielding:

$L(N) \approx \left( \frac{N_c}{N} \right)^{\alpha_N}$

To understand these patterns, it’s essential to analyze performance concerning $N$. When considering the total parameter count, including embedding parameters, the trend becomes somewhat unclear. This implies that the embedding matrix’s size can potentially be reduced without affecting performance, as demonstrated in recent studies. Despite being trained on the WebText2 dataset, these models exhibit test loss trends on various other datasets that also follow a power-law in $N$ with nearly identical powers.

Proposed Equation:

We have chosen the parameterization (1.5) (repeated here for convenience):

\[L(N, D) = [(\frac{N_{C}}{N})^{\frac{\alpha_{N}}{\alpha_{D}}}+\frac{D_{C}}{D}]^{\alpha_{D}}\]
  1. Changes in vocabulary size or tokenization are expected to rescale the loss by an overall factor. The parameterization of $L(N, D)$ (and all models of the loss) must naturally allow for such a rescaling.

  2. Fixing $D$ and sending $N \rightarrow \infty$, the overall loss should approach $L(D)$. Conversely, fixing $N$ and sending $D \rightarrow \infty$, the loss must approach $L(N)$.

  3. $L(N, D)$ should be analytic at $D = \infty$, so that it has a series expansion in $\frac{1}{D}$ with integer powers. Theoretical support for this principle is significantly weaker than for the first two.

The early-stopped test loss $L(N, D)$ depends predictably on the dataset size $D$ and model size $N$ according to Equation (1.5).

Left: For large $D$, performance is a straight power law in $N$. For a smaller fixed $D$, performance stops improving as $N$ increases and the model begins to overfit. (The reverse is also true, see Figure 4.). Right: The extent of overfitting depends predominantly on the ratio $\frac{N \alpha}{N \alpha D} / D$, as predicted in equation (4.3). The line is our fit to that equation.

Our choice of $L(N, D)$ satisfies the first requirement because we can rescale $Nc$, $Dc$ with changes in the vocabulary. This also implies that the values of $Nc$, $Dc$ have no fundamental meaning.

Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we expect that larger models should always perform better than smaller models. But with fixed finite $D$, we also do not expect any model to be capable of approaching the best possible loss (i.e., the entropy of text). Similarly, a model with fixed size will be capacity-limited. These considerations motivate our second principle. Note that knowledge of $L(N)$ at infinite $D$ and $L(D)$ at infinite $N$ fully determines all the parameters in $L(N, D)$.

The third principle is more speculative. There is a simple and general reason one might expect overfitting to scale $\propto \frac{1}{D}$ at very large \(D\). Overfitting should be related to the variance or the signal-to-noise ratio of the dataset, and this scales as $\frac{1}{D}$. This expectation should hold for any smooth loss function since we expect to be able to expand the loss about the $D \rightarrow \infty$ limit. However, this argument assumes that $\frac{1}{D}$ corrections dominate over other sources of variance, such as the finite batch size and other limits on the efficacy of optimization. Without empirical confirmation, we would not be very confident of its applicability. Our third principle explains the asymmetry between the roles of $N$ and $D$ in Equation (1.5). Very similar symmetric expressions are possible, but they would not have a $\frac{1}{D}$ expansion with integer powers and would require the introduction of an additional parameter.

In any case, we will see that our equation for $L(N, D)$ fits the data well, which is the most important justification for our $L(N, D)$ ansatz.

Optimal Allocation of the Compute Budget :

We displayed the empirical trend of performance as a function of the computation used during training in the top-right of Figure 1. However, this result involved training at a fixed batch size $B$, whereas we know that in fact we could train more efficiently by training at the batch size $B_{\text{crit}}$ discussed in Section 5.1. Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively, and correcting for this inefficiency by standardizing to the critical batch size results in cleaner and more predictable trends.In this section, we will adjust for this oversight. More importantly, we will use the results of Section 5 to determine the optimal allocation of compute between model size $N$ and the quantity of data processed during training, namely $2B_{\text{crit}}S_{\text{min}}$. We will determine this allocation both empirically and theoretically, by using the equation for $L(N, S_{\text{min}})$, and we will demonstrate that these methods agree.

Left: Given a fixed compute budget, a particular model size is optimal, though somewhat larger or smaller models can be trained with minimal additional compute. Right: Models larger than the compute efficient size require fewer steps to train, allowing for potentially faster training if sufficient additional parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve, after initial transient effects

Conclusion:

We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count $N$, dataset size $D$, and optimized training computation $C_{\text{min}}$, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with $N$, $D$, $C_{\text{min}}$ are power-laws, there are diminishing returns with increasing scale.

Defining words using the wc utility, the WebText2 dataset has 1.4 tokens per word and 4.3 characters per token. After this work was completed, [RRBS19a] also appeared, which makes similar predictions for the dependence of loss on both model and dataset size.

We were able to precisely model the dependence of the loss on $N$ and $D$, and alternatively on $N$ and $S$, when these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework. One might interpret these relations as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic constituents.

It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to test these relations on other domains, such as images, audio, and video models, and perhaps also for random network distillation. At this point we do not know which of our results depend on the structure of natural language data, and which are universal. It would also be exciting to find a theoretical framework from which the scaling relations can be derived: a ‘statistical mechanics’ underlying the ‘thermodynamics’ we have observed. Such a theory might make it possible to derive other more precise predictions and provide a systematic understanding of the limitations of the scaling laws.

In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: “more is different”. For example, the smooth aggregate growth of the economy provides no indication of the specific technological developments that underwrite it. Similarly, the smooth improvements in language model loss may hide seemingly qualitative changes in capability.

Our results strongly suggest that larger models will continue to perform better and will also be much more sample efficient than has been previously appreciated. Big models may be more important than big data. In this context, further investigation into model parallelism is warranted. Deep models can be trained using pipelining [HCC+18], which splits parameters depth-wise between devices, but eventually requires increased batch sizes as more devices are used. Wide networks, on the other hand, are more amenable to parallelization [SCP+18], since large layers can be split between multiple workers with less serial dependency. Sparsity [CGRS19, GRK17] or branching (e.g., [KSH12]) may allow for even faster training of large networks through increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, it might be possible to remain on the compute-efficient frontier for an entire training run.

Paper 3: LIMA: Less Is More for Alignment

Introduction

This paper discusses the use of language models trained at a large scale for various language understanding and generation tasks. Alignment methods have been proposed to fine-tune these models for specific tasks, typically requiring significant compute and specialized datasets. However, the authors demonstrate that strong performance can be achieved with just 1,000 carefully curated training examples. The hypothesis is that alignment involves the model learning how to interact with users in terms of style or format, leveraging knowledge acquired during pretraining. To test this, the authors curate examples approximating real user prompts and responses, including questions and answers from community forums and manually written prompts and responses. They then fine-tune a large pretrained model, LIMA, on this dataset.

Comparative evaluations against other models show that LIMA outperforms some state-of-the-art models in human preference studies, achieving equal or preferable responses in a significant percentage of cases. A detailed analysis of LIMA’s responses reveals high adherence to prompt requirements and a considerable proportion of excellent responses. Further experiments highlight the importance of prompt diversity and data quality over sheer quantity in improving model performance. Additionally, despite lacking dialogue examples, LIMA shows competence in conducting coherent multi-turn dialogues, which can be significantly enhanced with a small number of hand-crafted dialogue chains added to the training data.

Overall, these findings underscore the effectiveness of pretraining compared to other approaches such as large-scale instruction tuning and reinforcement learning, showcasing the potential of pretrained models even with limited fine-tuning data.

Alignment : Superficial Alignment Hypothesis

The Superficial Alignment Hypothesis posits that a model’s knowledge and capabilities are predominantly acquired during pretraining, while alignment primarily teaches the model which distribution of formats to utilize when interacting with users. If this hypothesis holds true, and alignment is mainly about learning style, then a corollary suggests that a pretrained language model could be adequately fine-tuned with a relatively small set of examples. To explore this, the authors gather a dataset comprising 1,000 prompts and responses. The responses exhibit stylistic alignment with each other, while the prompts are diverse. These examples are sourced from various platforms, primarily community Q&A forums and manually created instances. Additionally, a test set of 300 prompts and a development set of 50 are collected for evaluation purposes. Table 1 provides an overview of the different data sources along with some statistics.

Alightment Data Types

The Superficial Alignment Hypothesis suggests that a language model primarily learns during pretraining and alignment mainly teaches it which format to use when interacting with users. Hence, fine-tuning with a small set of examples may be sufficient. To test this, the authors collect 1,000 prompts and responses where responses are stylistically aligned but prompts are diverse. These are sourced from community Q&A forums and manually authored examples. A test set of 300 prompts and a development set of 50 are collected. Data from Stack Exchange, wikiHow, and Reddit are mined, with Stack Exchange and wikiHow providing well-aligned content suitable for automated extraction, while Reddit requires manual curation due to its more casual nature. Manually authored examples are diversified further, including natural language generation tasks and prompts with varying tones. This manual curation aims to enhance diversity and quality in contrast to automatic methods that prioritize quantity.

Training LIMA

LIMA (Less Is More for Alignment) is trained using the LLaMa 65B model and fine-tuned on a 1,000-example alignment training set. A special end-of-turn token (EOT) is introduced to differentiate between speakers. Standard fine-tuning hyperparameters are followed, including 15 epochs of training using AdamW optimizer, with batch size set to 32 examples. Residual dropout is applied, starting at 0.0 at the bottom layer and linearly increasing to 0.3 at the last layer. Manual selection of checkpoints is performed based on development set evaluation. LIMA is evaluated against state-of-the-art models and demonstrates superior performance compared to RLHF-based DaVinci003 and a 65B-parameter reproduction of Alpaca trained on 52,000 examples, often producing better or equal responses to GPT-4. The fact that simple fine-tuning on a small dataset can compete with state-of-the-art models supports the Superficial Alignment Hypothesis, emphasizing the importance of pretraining over large-scale instruction tuning and reinforcement learning approaches.

Human evaluation is done by comparing LIMA with state of the art models. LIMA surpasses OpenAI’s DaVinci003 and a 65B-parameter Alpaca reproduction and frequently matches or exceeds GPT-4’s performance. Analysis of LIMA’s outputs reveals 50% are deemed excellent. This success with minimal fine-tuning supports the Superficial Alignment Hypothesis, emphasizing the dominance of pretraining over extensive instruction tuning and reinforcement learning methods.

Experimental Setup and Results

To evaluate LIMA against other models, we compare single responses generated for each test prompt and ask both crowd workers and GPT-4 to assess preference. Five baselines are compared: Alpaca 65B, OpenAI’s DaVinci003, Google’s Bard, Anthropic’s Claude, and OpenAI’s GPT-4, all sampled from April 2023. For generation, nucleus sampling with a probability of 0.9 and temperature of 0.7 is employed, with a repetition penalty and a maximum token length of 2048.

Annotators are presented with a prompt and two responses from different models, tasked with labeling the better response or indicating no significant difference. This methodology is mirrored with GPT-4 for comparison. Inter-annotator agreement is computed using tie-discounted accuracy, with high agreement observed among human annotators: 82% for crowd-crowd, 81% for crowd-author, and 78% for author-author. Agreement between GPT-4 and humans is also notable, with scores of 78% for crowd-GPT and 79% for author-GPT.

Despite the subjectivity inherent in the task, there is strong agreement among human annotators. GPT-4’s agreement with human annotations suggests its performance is on par with human judgment, essentially passing the Turking Test for this evaluation task.

Figure on left presents results from our human preference study, while Figure on right shows GPT-4 preferences, with similar trends observed in both. Despite Alpaca 65B training on significantly more data, LIMA tends to produce more preferable outputs. DaVinci003, trained with RLHF, also falls short of LIMA’s performance. Bard occasionally outperforms LIMA but LIMA matches or exceeds Bard’s performance 58% of the time. Although Claude and GPT-4 generally perform better, LIMA occasionally produces superior responses, with even GPT-4 preferring LIMA outputs in 19% of cases.

Why Less is More ?

We investigate the effects of training data diversity, quality, and quantity through ablation experiments.We observe that, for the purpose of alignment, scaling up input diversity and output quality have measurable positive effects, while scaling up quantity alone might not

Experiment Setup: We fine-tune a 7B parameter LLaMa model on various datasets, maintaining consistent hyperparameters. Responses are graded for helpfulness on a likert scale by ChatGPT (GPT-3.5 Turbo), with results reported with confidence intervals.

Diversity: We compare Stack Exchange and wikiHow data to examine prompt diversity’s impact on response quality. Stack Exchange offers diverse prompts with excellent responses, while wikiHow presents homogeneous prompts. Training on Stack Exchange significantly outperforms wikiHow, indicating the importance of diversity.

Quality: We assess response quality by comparing models trained on Stack Exchange data with and without quality filters. The filtered dataset yields significantly better performance, highlighting the importance of response quality.

Quantity: Despite common wisdom suggesting increased training data improves performance, doubling the training set size from Stack Exchange does not enhance response quality. This finding suggests alignment’s scaling laws prioritize prompt diversity and high-quality responses over sheer quantity.

Task Generalization Capability

LIMA’s performance in multi-turn dialogue with just 1,000 single-turn interactions is assessed across 10 live conversations, with responses categorized as Fail, Pass, or Excellent. While responses demonstrate surprising coherence for a zero-shot chatbot, LIMA struggles to follow the prompt within 3 interactions in 6 out of 10 conversations.

To enhance LIMA’s conversational abilities, 30 multi-turn dialogue chains are collected, with 10 authored by the researchers and 20 derived from Stack Exchange comments. A new version of LIMA is fine-tuned using these examples, leading to significant improvements in generation quality. The proportion of excellent responses increases from 45.2% to 76.1%, while the failure rate drops noticeably. In addition, the fine-tuned model outperforms the zero-shot model in 7 out of 10 conversations, suggesting that limited supervision can invoke learned capabilities from pretraining.

The above is an example of a multi-turn dialogue case use to check task generalization capability.

Limitations and Conclusion

Fine-tuning a strong pretrained language model on 1,000 carefully curated examples can produce remarkable, competitive results on a wide range of prompts

Limitations: The mental effort in constructing such examples is significant and difficult to scale up. LIMA is not as robust as product-grade models while LIMA typically generates good responses, an unlucky sample during decoding or an adversarial prompt can often lead to a weak response

Paper 3: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Introduction

Large Language Models (LLMs) have seen significant growth in size and capabilities, leading to challenges in deployment and environmental concerns due to high energy consumption. Post-training quantization, reducing precision to create low-bit models for inference, has emerged as a solution. However, recent work on 1-bit model architectures like BitNet presents a promising direction for reducing LLM costs while maintaining performance. BitNet’s matrix multiplication involves only integer addition, saving energy costs and enabling faster computation.

In addition to computation, transferring model parameters during inference can be expensive. BitNet and its variant, BitNet b1.58, significantly reduce memory footprint and loading time from DRAM, leading to more efficient inference. BitNet b1.58 introduces an additional value of 0, offering stronger modeling capability by supporting feature filtering and matching full precision baselines in perplexity and end-task performance from a smaller size.

What is 1.58 bit ?

BitNet b1.58 is based on the BitNet architecture, which is a Transformer that replaces $\text{nn.Linear}$ with BitLinear. It is trained from scratch, with 1.58-bit weights and 8-bit activations. Compared to the original BitNet, it introduces some modifications that we summarize below.

Quantization Function. To constrain the weights to -1, 0, or +1, we adopt an absmean quantization function. It first scales the weight matrix by its average absolute value, and then round each value to the nearest integer among ${-1, 0, +1}$:

\[W_f = \text{RoundClip}\left( \frac{W}{\gamma} + \epsilon, -1, 1\right), \\ \text{RoundClip}(x, a, b) = \max(a, \min(b,\text{round}(x))), \\ \gamma = \frac{1}{nm} \sum_{i,j} |W_{ij}|.\]

The quantization function for activations follows the same implementation in BitNet, except that we do not scale the activations before the non-linear functions to the range [0, $Q_b$]. Instead, the activations are all scaled to $[-Q_{b}, Q_{b}]$ per token to get rid of the zero-point quantization. This is more convenient and simple for both implementation and system-level optimization, while introduces negligible effects to the performance in our experiments.

Results

We compared BitNet b1.58 to our reproduced FP16 LLaMA LLM across various sizes, pre-training them on the RedPajama dataset for 100 billion tokens. Zero-shot performance was evaluated on tasks like ARC-Easy, ARC-Challenge, Hellaswag, Winogrande, PIQA, Open-bookQA, and BoolQ, alongside reporting validation perplexity on WikiText2 and C4 datasets. The runtime GPU memory and latency were also compared using FasterTransformer3, optimized for LLM inference latency on GPU devices. BitNet b1.58 matched full precision LLaMA LLM at 3B model size in perplexity, being 2.71 times faster and using 3.55 times less GPU memory. At 3.9B model size, BitNet b1.58 was 2.4 times faster, consumed 3.32 times less memory, and performed significantly better than LLaMA LLM 3B.

Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the model size.

We scaled up the model size to 7B, 13B, and 70B and evaluated the cost. Figure above illustrates the trends of latency and memory, showing that the speed-up increases as the model size scales. In particular, BitNet b1.58 70B is 4.1 times faster than the LLaMA LLM baseline. This is due to the growing time cost for nn.Linear with the model size. The memory consumption follows a similar trend, as the embedding remains full precision and its memory proportion is smaller for larger models. Both latency and memory were measured with a 2-bit kernel, indicating potential for further optimization to reduce the cost )

We estimate the energy consumption of arithmetic operations for both BitNet b1.58 and LLaMA LLM, focusing on matrix multiplication, which contributes most to LLM costs. Figure above shows the energy cost composition. BitNet b1.58 primarily involves INT8 addition calculations, while LLaMA LLM includes both FP16 addition and multiplication. Based on [Hor14, ZZL22], BitNet b1.58 saves 71.4 times energy consumption for matrix multiplication on 7nm chips. Additionally, we report end-to-end energy costs for models with 512 tokens. Our findings reveal that as model size increases, BitNet b1.58 becomes increasingly more energy-efficient compared to the FP16 LLaMA LLM baseline. This is attributed to nn.Linear’s growing percentage with model size, while costs from other components decrease for larger models.

Discussion and Conclusion

1-bit Mixture-of-Experts (MoE) LLMs offer a cost-effective solution by reducing computation FLOPs, addressing challenges of high memory consumption and inter-chip communication. The reduced memory footprint of 1.58-bit LLMs enables easier deployment of MoE models, reducing overhead in transferring activations across networks. BitNet b1.58 facilitates native support for long sequences by reducing activations from 16 bits to 8 bits, potentially doubling the context length with the same resources. This advancement can further be compressed to 4 bits or lower, enhancing long sequence handling capabilities. The use of 1.58-bit LLMs holds promise for improving language model performance on edge and mobile devices, overcoming limitations in memory and computational power. These devices can now deploy LLMs, expanding their application scope and enabling new functionalities. Additionally, 1.58-bit LLMs are well-suited for CPU devices commonly used in edge and mobile devices, enhancing their performance. Recent advancements in hardware like Groq demonstrate potential for building specific hardware for LLMs. There’s a call for designing new hardware optimized for 1-bit LLMs, leveraging the unique computation paradigm enabled by BitNet.