Dr. Yanjun Qi

Recent GenAI Papers I Read.

Reviews Indexed

Toggle Menu

Index
Recent Posts By GenAI Category
- FM Basic
- FM Adapt
- FM Risk
- FM Reasoning
- FM Agent
- FM Platform
- FM Efficiency

FMEfficient

Recent Readings for Efficiency of Foundation Models (since 2022) (Index of Posts):

No.	Read Date	Title and Information	We Read @
1	2025, Apr, 20	Model serving - Efficiency Inference	2025-S4
2	2025, Feb, 10	Platform - Prompting Engineering tools / Prompt Compression	2025-S4
3	2024, Apr, 30	KV Cache and Tooling	2024-S27
4	2024, Apr, 25	Advanced Transformer Architectures	2024-S26
5	2024, Apr, 18	Recent LLM basics	2024-S24
6	2024, Apr, 2	LLM Scaling law and Efficiency	2024-S19

Here is a detailed list of posts!

[1]: Model serving - Efficiency Inference

read on: - 20 Apr 2025
Serving

Model Serve Readings:

Splitwise: https://arxiv.org/pdf/2311.18677

+[Submitted on 30 Nov 2023 (v1), last revised 20 May 2024 (this version, v2)]

Splitwise: Efficient generative LLM inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini
Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today’s GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

DistServe: https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf

[Submitted on 18 Jan 2024 (v1), last revised 6 Jun 2024 (this version, v3)]
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application’s TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster’s bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests. Comments: OSDI 2024

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
[Submitted on 27 May 2022 (v1), last revised 23 Jun 2022 (this version, v2)]
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao
[Submitted on 17 Jul 2023]
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).

[2]: Platform - Prompting Engineering tools / Prompt Compression

read on: - 10 Feb 2025
Prompting

In this session, our readings cover:

Required Readings:

The Prompt Report: A Systematic Survey of Prompting Techniques

URL
[Submitted on 6 Jun 2024 (v1), last revised 30 Dec 2024 (this version, v5)]
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, Philip Resnik
Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.

Prompt Compression for Large Language Models: A Survey

[Submitted on 16 Oct 2024 (v1), last revised 17 Oct 2024 (this version, v2)]
URL
Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier
Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.

[3]: KV Cache and Tooling

read on: - 30 Apr 2024
Efficiency

KV Caching in LLM:

grouped query attention: https://arxiv.org/pdf/2305.13245.pdf
Paged attention https://arxiv.org/pdf/2309.06180.pdf https://openreview.net/pdf?id=uNrFpDPMyo

Must know tools for training/finetuning/serving LLM’s -

Torchtune - Build on top of Pytorch, for training and finetuning LLM’s. Uses yaml based configs for easily running experiments. Github -
axolotl - Built on top on Huggigface peft and transformer library, supports fine-tuning a large number for models like Mistral, LLama etc. Provides support for techniques like RLHF, DPO, LORA, qLORA etc. Github
LitGPT - Build on nanoGPT and Megatron, support pre-training and fine-tuning, has examples like Starcoder, TinyLlama etc. Github -
Maxtext - Jax based library for training LLM’s on Google TPU’s with configs for models like Gemma, Mistral and LLama2 etc. Github
Langchain- https://python.langchain.com/docs/get_started/introduction
haystack.deepset.ai
- https://github.com/deepset-ai/haystack
- LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it’s best suited for building RAG, question answering, semantic search or conversational agent chatbots.
LlamaIndex
- https://docs.llamaindex.ai/en/stable/ LlamaIndex supports Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, LlamaIndex: retrieves information from your data sources first, / adds it to your question as context, and / asks the LLM to answer based on the enriched prompt.
Making Retrieval Augmented Generation Fast
- https://www.pinecone.io/learn/fast-retrieval-augmented-generation/
OpenMoE
- https://github.com/XueFuzhao/OpenMoE

[4]: Advanced Transformer Architectures

read on: - 25 Apr 2024
Efficiency

In this session, our readings cover:

Required Readings:

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

https://arxiv.org/abs/2311.12351
Transformer-based Large Language Models (LLMs) have been applied in diverse areas such as knowledge bases, human interfaces, and dynamic agents, and marking a stride towards achieving Artificial General Intelligence (AGI). However, current LLMs are predominantly pretrained on short text snippets, which compromises their effectiveness in processing the long-context prompts that are frequently encountered in practical scenarios. This article offers a comprehensive survey of the recent advancement in Transformer-based LLM architectures aimed at enhancing the long-context capabilities of LLMs throughout the entire model lifecycle, from pre-training through to inference. We first delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. We then provide a taxonomy and the landscape of upgrades on Transformer architecture to solve these problems. Afterwards, we provide an investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as optimization toolkits such as libraries, frameworks, and compilers to boost the efficacy of LLMs across different stages in runtime. Finally, we discuss the challenges and potential avenues for future research. A curated repository of relevant literature, continuously updated, is available at this https URL.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Paper: https://arxiv.org/abs/2205.14135
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
Related: blogpost FlashAttention — Techniques for Efficient Inference of LLMs (III/IV)

JAMBA

Introducing Jamba: AI21’s Groundbreaking SSM-Transformer Model Debuting the first production-grade Mamba-based model delivering best-in-class quality and performance.
March 28, 2024
https://www.ai21.com/blog/announcing-jamba
We are thrilled to announce Jamba, the world’s first production-grade Mamba based model. By enhancing Mamba Structured State Space model (SSM) technology with elements of the traditional Transformer architecture, Jamba compensates for the inherent limitations of a pure SSM model. Offering a 256K context window, it is already demonstrating remarkable gains in throughput and efficiency—just the beginning of what can be possible with this innovative hybrid architecture. Notably, Jamba outperforms or matches other state-of-the-art models in its size class on a wide range of benchmarks.

[5]: Recent LLM basics

read on: - 18 Apr 2024
Efficiency BasicLLM

In this session, our readings cover:

Require Readings:

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

https://arxiv.org/abs/2312.15234
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

https://arxiv.org/abs/2304.01373
How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{this https URL}.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

https://arxiv.org/abs/2403.09611
Multimodal LLM Pre-training - provides a comprehensive overview of methods, analysis, and insights into multimodal LLM pre-training; studies different architecture components and finds that carefully mixing image-caption, interleaved image-text, and text-only data is key for state-of-the-art performance; it also proposes a family of multimodal models up to 30B parameters that achieve SOTA in pre-training metrics and include properties such as enhanced in-context learning, multi-image reasoning, enabling few-shot chain-of-thought prompting.

[6]: LLM Scaling law and Efficiency

read on: - 02 Apr 2024
Efficiency

In this session, our readings cover:

Required Readings:

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
https://github.com/RUCAIBox/LLMSurvey

Efficient Large Language Models: A Survey

https://arxiv.org/abs/2312.03863
https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey
Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding, language generation, and complex reasoning and have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency this http URL this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at this https URL, and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Recent research, such as BitNet [23], is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Here is a name list of posts!

BackTop

Model serving - Efficiency Inference

10 minute read

Model Serve Readings:

Platform - Prompting Engineering tools / Prompt Compression

2 minute read

In this session, our readings cover:

KV Cache and Tooling

3 minute read

KV Caching in LLM:

Advanced Transformer Architectures

24 minute read

In this session, our readings cover:

Recent LLM basics

21 minute read

In this session, our readings cover:

LLM Scaling law and Efficiency

39 minute read

In this session, our readings cover:

Dr. Yanjun Qi

FMEfficient

Recent Readings for Efficiency of Foundation Models (since 2022) (Index of Posts):

Here is a detailed list of posts!

[1]: Model serving - Efficiency Inference

Model Serve Readings:

Splitwise: https://arxiv.org/pdf/2311.18677

DistServe: https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

more readings

Generative AI on the Edge: Architecture and Performance Evaluation

Efficient Transformers: A Survey

A Survey on Model Compression for Large Language Models

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption

[2]: Platform - Prompting Engineering tools / Prompt Compression

Required Readings:

The Prompt Report: A Systematic Survey of Prompting Techniques

Prompt Compression for Large Language Models: A Survey

More Readings:

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

[3]: KV Cache and Tooling

KV Caching in LLM:

Must know tools for training/finetuning/serving LLM’s -

More readings

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Retentive Network: A Successor to Transformer for Large Language Models

RWKV: Reinventing RNNs for the Transformer Era

[4]: Advanced Transformer Architectures

Required Readings:

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

JAMBA

More readings:

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Efficient Memory Management for Large Language Model Serving with PagedAttention

Attention Mechanisms in Computer Vision: A Survey

[5]: Recent LLM basics

Require Readings:

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

More Readings:

Sparks of Large Audio Models: A Survey and Outlook

[6]: LLM Scaling law and Efficiency

Required Readings:

Scaling Laws for Neural Language Models

Efficient Large Language Models: A Survey

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

More Readings:

An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing

LIMA: Less Is More for Alignment /

Here is a name list of posts!

Model serving - Efficiency Inference

Platform - Prompting Engineering tools / Prompt Compression

KV Cache and Tooling

Advanced Transformer Architectures

Recent LLM basics

LLM Scaling law and Efficiency