Open Source LLM - Mistral Data preparation

BasicLLM

In this session, our readings cover:

Required Readings:

Mistral 7B

More Readings:

OLMo: Accelerating the Science of Language Models

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.

Mixtral of Experts

- Llama 2: Open Foundation and Fine-Tuned Chat Models

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Blog: Section 1: The Pile

In this section, we are going to introduce a paper: The pile, an open source dataset for diverse text for language modeling.

Motivation

Their work is driven by several key considerations. As the size of Large Language Models (LLMs) continues to expand rapidly, so does the need for vast amounts of data to effectively train these models. However, major players in the tech industry, such as Google and OpenAI, tend to keep their models and data closely guarded due to their commercial interests. Inspired by the principles of open-source software, they advocate for a similar ethos in the realm of LLMs. Open-sourcing data offers numerous advantages, including enhanced accessibility, opportunities for community collaboration, and the establishment of robust benchmarking and evaluation standards.

In line with this philosophy, various open-source datasets already exist on the internet, including The Common Crawl, RefinedWeb, Starcoder Data, and C4. However, in this section, they introduce a new and unique addition: The Pile. Their primary objective with The Pile is to enhance data diversity, thereby enriching the dataset’s capabilities for modeling and training.

The Pile Components

The Pile comprises an 800GB dataset curated from 22 diverse datasets, covering a wide range of domains such as Academic, Internet, Prose, Dialogue, and Miscellaneous. The composition of The Pile by category is illustrated in Figure 1, with a more detailed breakdown provided in Figure 2. This comprehensive coverage ensures that The Pile encompasses a broad spectrum of datasets.

Furthermore, let’s examine the structural statistics of the data. Firstly, the majority of documents in The Pile remain short, typically less than 10k bytes. However, there is also a long tail, indicating a small number of documents with lengths extending up to 60k bytes. Secondly, from a linguistic perspective, 97.4% of The Pile’s dataset is in English. While The Pile aims to be multilingual-friendly, future expansion efforts will be necessary to achieve this goal.

Benchmark Models with The Pile

In this study, Bits per UTF-8 encoded byte (BPB) is utilized to evaluate perplexity, which measures the efficacy of AI in predicting the subsequent word. GPT2/3 models are employed to assess The Pile. Remarkably, as illustrated in the Figure, performance improves progressively with the expansion of model parameters, even when GPT2/3 models are not trained on The Pile. This finding, observed as early as 2020, underscores the significance of the study’s results at the time of its publication.

Benchmark on different Componet

To further confirm how diversity improves the dataset’s capability, we need to evaluate how the diverse dataset enhances performance on individual components. Unfortunately, due to resource limitations, the authors could not train GPT-3 from scratch on The Pile dataset. Instead, they opted for a proxy approach using the formula below:

The parameter ∆set represents the difference in performance of the GPT-3 model when evaluated on The Pile dataset (Lset) and its performance when evaluated on the OWT2 dataset (Lowt2).

The term ∆set allows researchers to assess how much harder The Pile dataset is for GPT-3 compared to OWT2, while also considering the relative difficulty of tasks and the potential performance improvement achievable by training models specifically on The Pile dataset.

Observing the dotted line in the figure, which represents the average performance improvement, we notice significant enhancements in certain fields, such as DM Mathematics, Enron Emails, and others. This suggests that if GPT-3 were trained from scratch on The Pile dataset, its performance could potentially surpass the baseline model. Through these insights, we gain valuable understanding of the potential benefits of training language models on diverse datasets like The Pile.

Evaluation

To evaluate how the diversity from The Pile improves model training effectiveness, GPT-2 was trained on three different datasets, and the Bits per UTF-8 encoded byte (BPB) metric was employed for evaluation across the datasets. Refer to the table below for details.

From our observations, The Pile outperforms every dataset, with CC-100 showing minimal improvements compared to our baseline dataset, Raw CC. Notably, certain fields, such as Github, Stack Exchange, and DM Mathematics, exhibit significant improvements. This underscores the effectiveness of training datasets with diverse content in enhancing model training quality.

More about the Pile

Another goal of this work is to address ethical and bias concerns in AI research, while also promoting and standardizing the practice of engaging with AI ethics. The paper’s analysis delves into various perspectives, including topic distribution, inappropriate content, sensitive content (gender, religion, race), and data authority. Readers interested in these aspects can explore the paper to find topics of interest.

Conclusion

In conclusion, this work introduces a new open-source dataset that has been widely adopted in the research community since its release. The study demonstrates the dataset’s capability enhancement by incorporating diverse categories of data through the evaluation process. Moreover, the work endeavors to address ethical and bias concerns in AI research, reflecting a commitment to responsible AI development.

Section 2 Mistral 7B

Why Mistral 7B

Here are essential components in Mistral 7B ( Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer)

Group-query attention**

Advantage:Accelerates the inference speed.Reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput

Sliding Window Attention

Using Stacked layers to attend information beyond the window size, where one hidden state can access up to h times k tokens.

Rolling Buffer Cache

Pre-fill and chunking

Result:

Here is Mistral 7B performance on different tasks (comparing to other open source LLM)

Mistral 7B performs equivalently to Llama2 that would be more than 3x its size. This is as much saved in memory and gained in throughput.

Finetuning Mistral 7B for Chat- Mistral 7B- Instruct

Guardrials

Section 3: Mixtral of Experts

1. Motivation

  1. The scale of a model is one of the most important metric for better model quality.
  2. How to scale up the model size under limited compute budget?

2. Contribution

The main contribution of this paper is:

  1. They proposed Mixtral 8x7B which have competitive performance with respect to accuracy and size and efficiency.
  2. They fine-tuned Mixtral 8x7B - Instruct and released it under Apache 2.0 licence which means their open-sourced model can be used for academic and commercial usage.

2.1 Mixtral 8x7B

Basically, Mixtral is made up of two components shown as the figure below:

2.2 Mixtral 8x7B - Instruct

3. History of MoE

  1. Adaptive Mixture of Local Experts (1991)

    The roots of MoEs come from the 1991 paper Adaptive Mixture of Local Experts. The idea, akin to ensemble methods, was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Each separate network, or expert, specializes in a different region of the input space. A gating network determines the weights for each expert. During training, both the expert and the gating are trained.

  2. Learning Factored Representations in a Deep Mixture of Experts (2013)

    In the traditional MoE setup, the whole system comprises a gating network and multiple experts. MoEs as the whole model have been explored in SVMs, Gaussian Processes, and other methods. The work by Eigen, Ranzato, and Ilya explored MoEs as components of deeper networks. This allows having MoEs as layers in a multilayer network, making it possible for the model to be both large and efficient simultaneously.

  3. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017)

    This work explored a mixture of experts in the context of NLP, scaled this idea of MoE to a 137B LSTM (the de-facto NLP architecture back then, created by Schmidhuber) by introducing sparsity, allowing to keep very fast inference even at high scale.

  4. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (2021)

    GLaM later proposed and developed a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. In their work, they integrade MoE layer into transformer architecture as shown in the figure.

  5. Switch Transformer (2022)

    Switch Transformer improved the design of MoE layer in Transformer architecture which is now the most popular Transformer-based MoE architecture recently in most large language models.

4. Mixtral 8x7B

Mixtral is based on a transformer architecture and uses the same modifications as described in Mistral 7B.

4.1 Model Architecture

The overall parameter architecture of Mixtral is similar to that of Mistral.

4.2 MoE Layer

Formulation of each MoE Layer

The output of each layer can be formulated as:

Actully only a few experts will be activated, like in the example below, only 2 experts are activated and get involved in the inference.

Sparsity

To only activate a few experts, the gating vector G(x) should be sparse, it is achieved through taking the softmax over the Top-K logits of a linear layer, which can be formulated as:

The formulation of topK(l) is:

Mixtral

In a Mixtral, the MoE layer is applied independently per token and replaces the feed-forward (FFN) sub-block of the transformer block. They use the same SwiGLU architecture as the expert function Ei(x) and set K = 2. This means each token is routed to two SwiGLU sub-blocks with different sets of weights. Taking this all together, the output y for an input token x is computed as:

5. Experiments

5.1 Setup

Mixtral is mainly compared with Llama2 because they are both open-sourced LLMs. They are compared on 6 tasks.

5.2 Accuracy Comparison

Figure below compares the performance of Mixtral with the Llama models in different categories. Mixtral surpasses Llama 2 70B across most metrics. In particular, Mixtral displays a superior performance in code and mathematics benchmarks.

5.3 Size and Efficiency Comparison

As a sparse Mixtureof-Experts model, Mixtral only uses 13B active parameters for each token. With 5x lower active parameters, Mixtral is able to outperform Llama 2 70B across most categories.

5.4 Comparison with Llama2 70B and GPT-3.5

They also report the performance of Mixtral 8x7B compared to Llama 2 70B and GPT-3.5. We observe that Mixtral performs similarly or above the two other models. On MMLU, Mixtral obtains a better performance, despite its significantly smaller capacity (47B tokens compared to 70B).

5.5 Multilingual Benmarks

The extra capacity allows Mixtral to perform well on multilingual benchmarks while maintaining a high accuracy in English. In particular, Mixtral significantly outperforms Llama 2 70B in French, German, Spanish, and Italian as shown below.

5.6 Long Range Performance

They test its long range performance on Passkey retrieval taks. This task is mainly to measure the ability of the model to retrieve a passkey inserted randomly in a long prompt.

Left figure below shows that Mixtral achieves a 100% retrieval accuracy regardless of the context length or the position of passkey in the sequence.

Right figure below shows that the perplexity of Mixtral on a subset of the proof-pile dataset decreases monotonically as the size of the context increases.

5.6 Bias Benchmarks

To identify possible flaws to be corrected by fine-tuning / preference modeling, they aslo measure the base model performance on Bias Benchmark for QA (BBQ) and Bias in Open-Ended Language Generation Dataset (BOLD).

5.8 Instruction Fine-tuning

Fine-tuning techniques they used:

5.9 Routing Analysis

This experiment aims at exploring whether experts are specialized to specific domain.

Setup

Result

According to the output of selected layer, they do not observe obvious patterns in the assignment of experts based on the topic. For instance, at all layers, the distribution of expert assignment is very similar for ArXiv papers (written in Latex), for biology (PubMed Abstracts), and for Philosophy (PhilPapers) documents.

Picture below are examples of text from different domains, where each token is highlighted with a background color corresponding to its selected expert.

Section 5: Llama 2: Open Foundation and Fine-Tuned Chat Models

From the following figure, we can see the development of large language models. Llama 2 model is released on 07/2023 and it is open-sourced.

The training process of Llama 2 model includes the Pre-training Methodology and Fine-tuning Methodology.

(1) Pre-training Methodology

To create the new family of Llama 2 models, the authors used an optimized auto-regressive transformer, but made several changes to improve performance. Specifically, they performed more robust data cleaning, updated data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for larger models.

For the training details, Llama 2 adopt most of the pretraining setting and model architecture from Llama 1: - use the standard transformer architecture - apply pre-normalization using RMSNorm - use the SwiGLU activation function - use rotary positional embeddings (RoPE)

The primary architectural differences between this two models are Llama 2 model increased context length and used grouped-query attention (GQA).

There are some problems in prior methods: (1) Absolute positional encoding is simple, but may not generalize well in longer sequences. (2) Relative positional bias (T5) is not efficient. In order to solve these problems, the authors apply rotation to word vectors to encode rotation, and maintain both absolute and relative positional embeddings in an input sentence. So they do not need to train custom parameters.

This figure illustrates the implementation of Rotary Position Embedding, or RoPE, which is an enhancement to the traditional position encoding used in transformer models. Unlike standard encoding that applies a fixed pattern to each element, RoPE dynamically encodes the position information by rotating the query and key vectors in the attention mechanism. In the top-left, you see a 2D representation of a query or key vector, marked as (X1, X2). RoPE applies a rotation matrix based on the position m — which rotates the vector to a new position, as shown by (X’1, X’2). This rotation embeds the positional information directly into the query/key, making it position-aware. Below, you see multiple layers of a transformer model with RoPE applied. The different colored blocks represent different dimensions of the query or key vectors. The numbers 1 through 6 indicate different positions in the sequence. The rotation matrix is unique for each position, thus rotating each dimension differently, as indicated by the various θ values. By integrating the position into the computation of attention, RoPE allows for more precise and context-aware interpretations of sequences, which is especially beneficial for tasks where the order and position of elements are crucial.

For different visions of Llama 2 models, 34B and 70B models used GQA for improved inference scalability.

In the above figure, we’re comparing three attention mechanisms used in neural networks: Multi-head, Grouped-query, and Multi-query attention. Multi-head attention uses multiple sets of keys, queries, and values to capture different features from the input data. Grouped-query attention simplifies this by having groups of queries share the same key and value, reducing computational load while still maintaining some multi-head benefits. Multi-query attention further simplifies by using a single key and value for all queries, which is efficient but less expressive.

After pretraining, results are not as good as other proprietary, closed-source models. (GPT-4 and PaLM-2-L.) But the Llama-2 model is still very competitive (only a pre-trained model).

(2) Fine-tuning Methodology

Fine-tuning Methodology includes Iterative Fine-Tuning. Sample K outputs from the model, select best candidate based on reward model. And it can be combined with PPO. Generating multiple samples in this manner can drastically increase the maximum reward of sample. It explores output space randomly, and performs SFT or PPO using samples with highest reward.

Fine-tuning Methodology also includes a novel concept called Ghost Attention, or GAtt for short. look at this comparison in the following figure. On the left, we have a typical scenario where a chatbot is tasked to always answer with emojis. However, it struggles with maintaining the context over multiple turns of conversation. For instance, when asked ‘How to go from Paris to NY?’, it provides a detailed text response, which is not what it’s supposed to do according to the ‘always answer with emojis’ rule. On the right, we introduce Ghost Attention. GAtt is an improved attention mechanism that addresses the pitfalls of multi-turn memory. It helps the model remember the ‘emoji-only’ rule across different interactions. So, when posed with the same question ‘How to go from Paris to NY?’, the GAtt-enhanced chatbot successfully responds with relevant emojis, illustrating travel and the destination. This visual contrast highlights the efficiency of Ghost Attention in maintaining consistency and context in chatbot interactions, a crucial advancement in conversational AI.

The following figure reports the progress of our different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models. On this set of evaluations, the authors outperform ChatGPT on both axes after RLHF-V3 (harmlessness and helpfulness >50%). Despite the aforementioned relevance of using our reward as a point-wise metric, it can arguably be biased in favor of Llama 2-Chat. Therefore, for a fair comparison, they additionally compute the final results using GPT-4 to assess which generation is preferred. The order in which ChatGPT and Llama 2-Chat outputs appeared in GPT-4 prompt are randomly swapped to avoid any bias. As expected, the win-rate in favor of Llama 2-Chat is less pronounced, although obtaining more than a 60% win-rate for our latest Llama 2-Chat.

The following table shows evaluation results on TruthfulQA, assessing the accuracy of different language models in generating responses that are both true and informative. For the LLama 2 model, as the model size increases from 7 billion to 70 billion parameters, there is a trend of improvement in producing true and informative responses in the TruthfulQA evaluation. The 70B variant of LLama 2 pre-trained model exhibits over 50% in combined true and informative responses, with a substantial increase in the percentage of purely true responses as well. It also shows that LLama 2-Chat model, achieves even higher accuracy, indicating the effectiveness of fine-tuning in enhancing the model’s ability to generate truthful information.

For the model safety, we can focus on Safety in Fine-Tuning, Safety in RLHF and Safety Evaluation.

During fine-tuning process, it gathers adversarial prompts and safe demonstrations in the SFT training set. It essentially probes for edge cases. Annotator writes both the prompt and the response in adversarial samples.

This image showcases how LLama 2, when fine-tuned for safety, responds to a prompt requesting a roast that includes brutal and offensive content. The model’s response demonstrates a refusal to engage in harmful behavior, highlighting the successful implementation of safety measures in fine-tuning. It emphasizes the importance of maintaining respectful interaction and suggests focusing on positive and constructive feedback instead. This illustrates the model’s ability to handle adversarial samples by promoting positive discourse and rejecting requests for negative output.

After gathering only a few thousand supervised demonstrations, the authors switched entirely to RLHF to teach the model how to write more nuanced responses. As shown in the following Figure 15, the authors use the mean reward model scores as proxies of model performance on safety and helpfulness. We can observe that when they increase the proportion of safety data, the model’s performance on handling risky and adversarial prompts improves dramatically, and we see a lighter tail in the safety reward model score distribution. Meanwhile, the mean helpfulness score remains constant. They hypothesize that this is because they already have a sufficiently large amount of helpfulness training data. Appendix A.4.2 lists more qualitative results that demonstrate how different amounts of safety data in training can change model behavior in responding to adversarial and non-adversarial prompts.

The following image presents results from a safety evaluation for the LLama 2 model, specifically demonstrating the percentages of toxic generations produced by the model across different demographic groups. It shows that pre-trained models generate a higher percentage of toxic outputs, which varies across demographic categories. However, after fine-tuning, the LLama 2-Chat model shows a dramatic reduction in toxicity, with zero or near-zero percentages across all groups. This indicates the effectiveness of fine-tuning in reducing the model’s generation of toxic content and improving its safety regarding different demographics.

Section 6: OLMo: Accelerating the Science of Language Models

Introduction

The success of ChatGPT has demonstrated that large language models have commercial values. The flip side of the commercial success, however, is that its model weights and training procedure becomes proprietary and protected by OpenAI. Therefore, ChatGPT and GPT-4 are also referred as “closed-source models”.

LLaMA is one of the many “open-source models” treated as a foundation by many developers to build AI applications by finetuning its open-sourced model weights. However, for researchers who aim to replicate and improve the foundation large language model or study the science behind it, many aspects of training LLaMA, such as the complete dataset or the model checkpoints, are still not open to the public. Open Language Model (OLMo) addresses this issue by open source the entire training and evaluation framework necessary for training a large language model with performance on par with LLaMA.

As shown in the table below, previous research that open-sourced language models either has some key aspect of the training/evaluation pipeline not disclosed to the public, for example Falcon’s Language Model, or does not have a comparable performance with LLaMA, in the case of LLM360. Open Language Model (OLMo) is the first to open source the whole training/evaluation framework and with the state-of-the-art performance.

Model Architecture

OLMo open sourced three sizes of models: a model with 1 billion parameters, a model with 7 billion parameters, and a model with 65 billion parameters. The 65B model is still under training at the time of writing the paper. The exact architecture is shown in the below table:

A more detailed model architecture for the OLMo-7B model along with the architectures of other 7-8B model are shown in the table below:

Pipeline for Creating the Dataset Dolma

One key aspect that is open sourced by OLMo is the complete dataset for pre-training the large language model. The released dataset is named Dolma and was preprocessed by the following steps:

Distributed Training: Hardware

Researchers for OLMo trained the same model twice on two different supercomputers, named LUMI and MosaicML. Training OLMo in LUMI utilized 1024 AMD MI250X GPUs and training the same model on MosaicML utilized 216 NVIDIA A100 GPUs. More details to the supercomputer setup is listed below:

Optimizer

To ensure a better memory efficiency for training OLMo, a ZeRO optimizer strategy via yTorch’s FSDP (Fully Sharded Data Parallel) framework is employed. The specific optimizer setting at the 7B scale is shown in the table below.

Evaluation

As demonstrated in the table and figure below, OLMo achieves performance comparable to other state-of-the-art language models both in terms of common sense reasoning and intrinsic evaluation by Paloma.

Finally, the paper also shows the carbon emission of training OLMo, with a slightly larger GPU Power consumption compared to training LLaMA2. Since LUMI supercomputer runs on clean energy, the carbon emission is considered 0.

Paper E. Llama 2: Open Foundation and Fine-Tuned Chat Models

E.1     Pre-training methodology

To create the new family of Llama 2 models, the authors used an optimized auto-regressive transformer but made several changes to improve performance.

Specifically, they performed more robust data cleaning, updated data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for larger models.

E.2     Training Details

  1. Adopt most of the pretraining setting and model architecture from Llama 1:
    • use the standard transformer architecture
    • apply pre-normalization using RMSNorm
    • use the SwiGLU activation function
    • use rotary positional embeddings (RoPE)
  2. Primary architectural differences:
    • increased context length
    • grouped-query attention (GQA)

E.3     Llama 2: Rotary Positional Embeddings (RoPE)

An enhancement to the traditional position encoding used in transformer models. RoPE dynamically encodes the position information by rotating the query and key vectors in the attention mechanism.

Problems in prior methods:

E.4     Llama 2: Grouped-query Attention (GQA)

Pre-trained Results

E.4     Fine-tuning methodology

Llama 2: Iterative Fine-Tuning

Llama 2: Ghost Attention (GAtt)

Llama 2: Fine-Tuning Results

Report the progress of our different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models.

E.5     Model Safety

Llama 2: Safety in Fine-Tuning: Adversarial Samples

Llama 2: Safety in RLHF

RLHF safety measures:

Helpfulness remains intact after safety tuning with RLHF.

Llama 2: Safety Evaluation

The fine-tuned versions of LLama 2-Chat, show virtually zero toxicity across all groups.