Reviews Indexed

Toggle Menu

Index
Recent Posts By GenAI Category
- FM Basic
- FM Adapt
- FM Risk
- FM Reasoning
- FM Agent
- FM Platform
- FM Efficiency

FMBasic

Recent Readings for Basic Topics of Foundation Models (since 2022) (Index of Posts):

No.	Read Date	Title and Information	We Read @
1	2025, Apr, 25	LLM Post-training	2025-S4
2	2025, Apr, 18	Inference test time scaling law	2025-S4
3	2025, Apr, 13	Model Interpretibility for FM	2025-S4
4	2025, Apr, 9	multimodal FMs - video / audio	2025-S4
5	2025, Apr, 4	More Model Serving Readings - SGlang + Chunked Prefill	2025-S4
6	2025, Apr, 2	LLM Alignment - PPO	2025-S4
7	2025, Jan, 20	more LLM basics - a survey	2025-S2
8	2025, Jan, 15	LLM basics - emergent ability and GenAI platform	2025-S1
9	2025, Jan, 13	Introduction	2025-S0
10	2024, Feb, 8	Open Source LLM - Mistral Data preparation	2024-S6
11	2024, Feb, 6	Survey human alignment	2024-S5
12	2024, Jan, 30	LLM evaluating framework	2024-S3
13	2024, Jan, 23	LLM basics	2024-S1
14	2022, Dec, 3	RLHF + InstructGPT	2022-W6
15	2022, Dec, 1	Stable diffusion + DreamBooth + LoRA	2022-W5
16	2022, Oct, 1	Emergent Abilities of LLM	2022-W4
17	2022, May, 3	A Generalist Agent + offline RL + UniMask	2022-W1

Here is a detailed list of posts!

[1]: LLM Post-training

read on: - 25 Apr 2025
Customization

In this session, our readings cover:

Required PPO readings

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
[Submitted on 16 Apr 2024 (v1), last revised 10 Oct 2024 (this version, v3)]
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at this https URL.

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

URL
[Submitted on 22 Nov 2024 (v1), last revised 29 Jan 2025 (this version, v3)]
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the Tulu 3 model weights and demo, we release the complete recipe – including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tulu 3 approach to more domains.

[2]: Inference test time scaling law

read on: - 18 Apr 2025
Scaling

In this session, our readings cover:

Required Readings:

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

[Submitted on 31 Mar 2025]
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, Chen Ma
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing’’ has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

[Submitted on 6 Aug 2024]
URL
Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model’s distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a “compute-optimal” scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

s1: Simple test-time scaling

https://arxiv.org/abs/2501.19393
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL

[3]: Model Interpretibility for FM

read on: - 13 Apr 2025
Safety

In this session, our readings cover:

Required Readings:

Open Problems in Mechanistic Interpretability

[Submitted on 27 Jan 2025]
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

Position-aware Automatic Circuit Discovery

Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model’s computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. First, we extend edge attribution patching, a gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema, which defines token spans with similar semantics across examples, enabling position-aware circuit discovery in datasets with variable length examples. We additionally develop an automated pipeline for schema generation and application using large language models. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.

[4]: multimodal FMs - video / audio

read on: - 09 Apr 2025
Multimodal

In this session, our readings cover:

Required Readings:

A Survey on Speech Large Language Models

Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu
[Submitted on 24 Oct 2024 (v1), last revised 25 Oct 2024 (this version, v2)]
Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs’ advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.

NVLM: Open Frontier-Class Multimodal LLMs

[Submitted on 17 Sep 2024 (v1), last revised 22 Oct 2024 (this version, v2)]
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at this https URL and will open-source the training code for the community soon.
There are two main approaches to building multimodal LLMs: - Method A: Unified Embedding Decoder Architecture approach;
- Method B: Cross-modality Attention Architecture approach.
Directly compared:
- Method A: The Unified Embedding Decoder Architecture (“decoder-only architecture,” NVLM-D),
- Method B: The Cross-Modality Attention Architecture (“cross-attention-based architecture,” NVLM-X),
- A hybrid approach (NVLM-H).
Key findings are as follows:
- NVLM-X: Offers superior computational efficiency for high-resolution images.
- NVLM-D: Delivers higher accuracy for OCR-related tasks.
- NVLM-H: Combines the strengths of both approaches for optimal performance.

LLMs Meet Multimodal Generation and Editing: A Survey

[Submitted on 29 May 2024 (v1), last revised 9 Jun 2024 (this version, v2)]
Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen
With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL

[5]: More Model Serving Readings - SGlang + Chunked Prefill

read on: - 04 Apr 2025

In this session, our readings cover:

Readings on Efficient Model Serving :

SGLang: Efficient Execution of Structured Language Model Programs

[Submitted on 12 Dec 2023 (v1), last revised 6 Jun 2024 (this version, v2)]
SGLang: https://arxiv.org/pdf/2312.07104
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng
Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at this https URL

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Chunked Prefill: https://www.usenix.org/system/files/osdi24-agrawal.pdf
[Submitted on 4 Mar 2024 (v1), last revised 17 Jun 2024 (this version, v3)]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at this https URL.

[6]: LLM Alignment - PPO

read on: - 02 Apr 2025
Customization

In this session, our readings cover:

Required Model serving Readings:

PPO Readings:

a simple blogpost: Preference Tuning LLMs: PPO, DPO, GRPO — A Simple Guide

https://anukriti-ranjan.medium.com/preference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

[Submitted on 23 Jul 2024]
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James)Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire)Cheng
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

[Submitted on 20 May 2024 (v1), last revised 24 Nov 2024 (this version, v4)]
Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, Yu Cao
As large language models (LLMs) continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However, unlike pretraining or fine-tuning a single model, scaling reinforcement learning from human feedback (RLHF) for training large language models poses coordination challenges across four models. We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the same GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and diverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO, rejection sampling, and other alignment techniques. Empowering state-of-the-art LLM development, OpenRLHF’s code is available at \url{this https URL}.

Towards a Unified View of Preference Learning for Large Language Models: A Survey

[Submitted on 4 Sep 2024 (v1), last revised 31 Oct 2024 (this version, v5)]
Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, Qingxiu Dong, Ce Zheng, Shanghaoran Quan, Wen Xiao, Ge Zhang, Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng Liu, Zeyu Cui, Jian Yang, Lei Sha, Houfeng Wang, Zhifang Sui, Peiyi Wang, Tianyu Liu, Baobao Chang
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM’s output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM’s performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

[Submitted on 23 Apr 2024 (v1), last revised 8 Feb 2025 (this version, v2)]
Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, Chitta Baral
This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks covering dialogue, reasoning, mathematical problem-solving, question answering, truthfulness, MT-Bench, Big Bench, and the Open LLM Leaderboard. We find that: (1) alignment methods often achieve near optimal performance even with smaller subsets of training data; (2) although they offer limited improvements on complex reasoning tasks, they enhance mathematical problem-solving; and (3) using an instruction tuned model improves truthfulness. These insights highlight the conditions under which alignment methods excel, as well as their limitations.

[7]: more LLM basics - a survey

read on: - 20 Jan 2025
BasicLLM

In this session, our readings cover:

Required Readings:

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao
[Submitted on 9 Feb 2024 (v1), last revised 20 Feb 2024 (this version, v2)]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs’ ability of general-purpose language understanding and generation is acquired by training billions of model’s parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

Extra Readings:

[8]: LLM basics - emergent ability and GenAI platform

read on: - 15 Jan 2025
BasicLLM

Readings:

Emergent Abilities of Large Language Models

“an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.”

Language Models are Few-Shot Learners

“GPT-3, 175B autoregerssive LLM; show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.”

GenAI Platform

tutorial URL

Extra Readings:

A survey of Generative AI Applications

https://arxiv.org/abs/2306.02781
Generative AI has experienced remarkable growth in recent years, leading to a wide array of applications across diverse domains. In this paper, we present a comprehensive survey of more than 350 generative AI applications, providing a structured taxonomy and concise descriptions of various unimodal and even multimodal generative AIs. The survey is organized into sections, covering a wide range of unimodal generative AI applications such as text, images, video, gaming and brain information. Our survey aims to serve as a valuable resource for researchers and practitioners to navigate the rapidly expanding landscape of generative AI, facilitating a better understanding of the current state-of-the-art and fostering further innovation in the field.

[9]: Introduction

read on: - 13 Jan 2025
BasicLLM

Readings:

Basics of ML and DL:

Basics of NLP

URL
Typical NLP tasks / Challenges / Pipeline
f() on natural language
- Before Deep NLP (Pre 2012) • (BOW / LSI / Topic Modeling LDA )
- Word2Vec (2013-2016) • (GloVe/ FastText)
- Recurrent NN (2014-2016) • LSTM
- Seq2Seq
- Attention
- Self-Attention (2016 – now )
- Transformer (attention only Seq2Seq)
- BERT / RoBERTa/ XLNet/ GPT / …
A good code walk through on transformer at URL

∂

[10]: Open Source LLM - Mistral Data preparation

read on: - 08 Feb 2024
BasicLLM

In this session, our readings cover:

Required Readings:

Mistral 7B

https://mistral.ai/news/announcing-mistral-7b/
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B – Instruct, that surpasses the Llama 2 13B – Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

[11]: Survey human alignment

read on: - 06 Feb 2024
Alignment

In this session, our readings cover:

Required Readings:

Aligning Large Language Models with Human: A Survey

https://arxiv.org/abs/2307.12966
https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo
https://huggingface.co/blog/stackllama

[12]: LLM evaluating framework

read on: - 30 Jan 2024
LLMEvaluate

In this session, our readings cover:

Required Readings:

Holistic Evaluation of Text-To-Image Models

https://arxiv.org/abs/2311.04287
The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at this https URL and the code at this https URL, which is integrated with the HELM codebase.

Holistic Evaluation of Language Models

https://arxiv.org/abs/2211.09110

[13]: LLM basics

read on: - 23 Jan 2024
BasicLLM

Required Readings:

Emergent Abilities of Large Language Models

Language Models are Few-Shot Learners

Extra Readings:

A survey of Generative AI Applications

https://arxiv.org/abs/2306.02781
Generative AI has experienced remarkable growth in recent years, leading to a wide array of applications across diverse domains. In this paper, we present a comprehensive survey of more than 350 generative AI applications, providing a structured taxonomy and concise descriptions of various unimodal and even multimodal generative AIs. The survey is organized into sections, covering a wide range of unimodal generative AI applications such as text, images, video, gaming and brain information. Our survey aims to serve as a valuable resource for researchers and practitioners to navigate the rapidly expanding landscape of generative AI, facilitating a better understanding of the current state-of-the-art and fostering further innovation in the field.

Generative AI: Perspectives from Stanford HAI

https://hai.stanford.edu/generative-ai-perspectives-stanford-hai

[14]: RLHF + InstructGPT

read on: - 03 Dec 2022
RL AGI language model Human Alignment

Papers	Paper URL	Abstract
Training language models to follow instructions with human feedback	URL	“further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.”
Deep reinforcement learning from human preferences	URL	“explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function”

[15]: Stable diffusion + DreamBooth + LoRA

read on: - 01 Dec 2022
Diffusion Image synthesis Efficiency

Stable diffusion

URL
“High-Resolution Image Synthesis with Latent Diffusion Models”

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

URL
“personalization” of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. .”

LoRA: Low-Rank Adaptation of Large Language Models

“propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.”

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

https://arxiv.org/abs/2208.01618
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new “words” in the embedding space of a frozen text-to-image model. These “words” can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

[16]: Emergent Abilities of LLM

read on: - 01 Oct 2022
language model

Emergent Abilities of Large Language Models

URL
“an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.”

Language Models are Few-Shot Learners

URL
“GPT-3, 175B autoregerssive LLM; show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.”

On the Opportunities and Risks of Foundation Models

” a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations).”

The Power of Scale for Parameter-Efficient Prompt Tuning

https://arxiv.org/abs/2104.08691
Brian Lester, Rami Al-Rfou, Noah Constant
In this work, we explore “prompt tuning”, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s “few-shot” learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method “closes the gap” and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed “prefix tuning” of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

[17]: A Generalist Agent + offline RL + UniMask

read on: - 03 May 2022
RL AGI

Papers

Paper URL

Abstract

A Generalist Agent

URL

Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

Why should we prefer offline reinforcement learning over behavioral cloning? ICLR 2022

URL

natural to ask: when can an offline RL method outperform BC with an equal amount of expert data, even when BC is a natural choice?

Uni[MASK]: Unified Inference in Sequential Decision Problems

URL

show how sequential decision making tasks can be thought of in terms of corresponding input maskings, enabling the training of a single model to perform all tasks at once. applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns.

Here is a name list of posts!

BackTop

FMBasic

Recent Readings for Basic Topics of Foundation Models (since 2022) (Index of Posts):

Here is a detailed list of posts!

[1]: LLM Post-training

Required PPO readings

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

More Readings:

[2]: Inference test time scaling law

Required Readings:

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

s1: Simple test-time scaling

More Readings:

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

[3]: Model Interpretibility for FM

Required Readings:

Open Problems in Mechanistic Interpretability

Position-aware Automatic Circuit Discovery

More Readings:

Mechanistic Interpretability for AI Safety – A Review

Linearity of Relation Decoding in Transformer Language Models

Claude’s extended thinking

Mapping the Mind of a Large Language Model

Using Dictionary Learning Features as Classifiers

Jailbreaking LLM-Controlled Robots

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

[4]: multimodal FMs - video / audio

Required Readings:

A Survey on Speech Large Language Models

NVLM: Open Frontier-Class Multimodal LLMs

LLMs Meet Multimodal Generation and Editing: A Survey

More Readings:

Video Understanding with Large Language Models: A Survey

Beta Release of Zonos-v0.1

[5]: More Model Serving Readings - SGlang + Chunked Prefill

Readings on Efficient Model Serving :

SGLang: Efficient Execution of Structured Language Model Programs

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

More Readings :

Orca: A Distributed Serving System for Transformer-Based Generative Models

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Neo: https://arxiv.org/pdf/2411.01142

Shortest Job First: https://arxiv.org/pdf/2408.15792

[6]: LLM Alignment - PPO

Required Model serving Readings:

PPO Readings:

a simple blogpost: Preference Tuning LLMs: PPO, DPO, GRPO — A Simple Guide

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

More Readings:

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

[7]: more LLM basics - a survey

Required Readings:

Large Language Models: A Survey

Extra Readings:

[8]: LLM basics - emergent ability and GenAI platform

Readings:

Emergent Abilities of Large Language Models

Language Models are Few-Shot Learners

GenAI Platform

Extra Readings:

A survey of Generative AI Applications

[9]: Introduction

Readings:

Basics of ML and DL:

Basics of NLP

[10]: Open Source LLM - Mistral Data preparation

Required Readings:

Mistral 7B

More Readings:

OLMo: Accelerating the Science of Language Models

Mixtral of Experts

- Llama 2: Open Foundation and Fine-Tuned Chat Models

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

[11]: Survey human alignment

Required Readings:

Aligning Large Language Models with Human: A Survey