Reviews Indexed

Toggle Menu

Index
Recent Posts By GenAI Category
- FM Basic
- FM Adapt
- FM Risk
- FM Reasoning
- FM Agent
- FM Platform
- FM Efficiency

Platform

Recent Readings for Platform Topics of Foundation Models (since 2022) (Index of Posts):

No.	Read Date	Title and Information	We Read @
1	2025, Mar, 5	Platform - Model Serving	2025-S4
2	2025, Mar, 3	Platform - Model Customization - instruction tuning / LoRA	2025-S4
3	2025, Feb, 26	Platform - VLM Jailbreaking / Probing	2025-S4
4	2025, Feb, 24	Platform - Model Jailbreaking / Safeguarding	2025-S4
5	2025, Feb, 19	Platform - More agent related	2025-S4
6	2025, Feb, 17	Platform - Agent Tooling	2025-S4
7	2025, Feb, 12	Platform - Context construction via RAG and Agent	2025-S4
8	2025, Feb, 10	Platform - Prompting Engineering tools / Prompt Compression	2025-S4

Here is a detailed list of posts!

[1]: Platform - Model Serving

read on: - 05 Mar 2025
Serving

In this session, our readings cover:

Required Readings:

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
[Submitted on 12 Sep 2023]
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM’s source code is publicly available at this https URL

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

covering: vLLM, continuous batching, chunked prefill, fair scheduling, KV cache management, and disaggregated serving,
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.
https://arxiv.org/pdf/2312.15234

A Survey on Large Language Model Acceleration based on KV Cache Management

URL
[Submitted on 27 Dec 2024 (v1), last revised 2 Jan 2025 (this version, v2)]
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen
Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications.

[2]: Platform - Model Customization - instruction tuning / LoRA

read on: - 03 Mar 2025
Customization

In this session, our readings cover:

Required Readings:

Low-Rank Adaptation for Foundation Models: A Comprehensive Review

Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru Zhang, Min Zhou, Irwin King, Rex Ying
[Submitted on 31 Dec 2024]
The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

[Submitted on 28 Jan 2025]
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model’s underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL’s superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks. Comments: Website at this https URL

DoRA: Weight-Decomposed Low-Rank Adaptation

[Submitted on 14 Feb 2024 (v1), last revised 9 Jul 2024 (this version, v6)]
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen
Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at this https URL.

[3]: Platform - VLM Jailbreaking / Probing

read on: - 26 Feb 2025
Jailbreaking Safety

In this session, our readings cover:

Required Readings:

garak: A Framework for Security Probing Large Language Models

[Submitted on 16 Jun 2024]
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security’’, and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model’s weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

[Submitted on 16 Aug 2024 (v1), last revised 22 Oct 2024 (this version, v4)]
Fenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang
As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model’s safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

[4]: Platform - Model Jailbreaking / Safeguarding

read on: - 24 Feb 2025
Jailbreaking Safety

In this session, our readings cover:

Required Readings:

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O’Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
[Submitted on 31 Jan 2025]
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek
[Submitted on 21 Feb 2024 (v1), last revised 17 May 2024 (this version, v2)]
Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of “jailbreaking”, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain. Comments: 18 pages, 9 figures, Accepted in ACL 2024

[5]: Platform - More agent related

read on: - 19 Feb 2025
Agent

In this session, our readings cover:

Required Readings:

Agent Tools/Libraries:

Introduces a cohesive AutoGen ecosystem that includes the framework, developer tools, and applications. The framework’s layered architecture clearly defines each layer’s functionality. It supports both first-party and third-party applications and extensions. Microsoft Research announces AutoGen v0.4, a major update to their multi-agent AI framework. The new version introduces a complete redesign with an asynchronous, event-driven architecture that improves code quality, robustness, and scalability. Key features include modular components, built-in debugging tools, cross-language support, and enhanced observability through OpenTelemetry integration.

The update brings a new three-layered framework architecture consisting of core building blocks, AgentChat API, and extensions. It also introduces improved developer tools including AutoGen Bench for performance testing, an upgraded AutoGen Studio with real-time agent updates and visual team building, and Magentic-One, a new generalist multi-agent application for handling web and file-based tasks. The release maintains backward compatibility through the AgentChat API, making the migration from v0.2 straightforward while adding new capabilities like streaming messages and improved task progress management.

https://docs.ag2.ai/docs/blog/2025-02-13-DeepResearchAgent/index

one Survey Blogpost on agent2024 …

https://open.substack.com/pub/victordibia/p/ai-agents-2024-rewind-a-year-of-building?r=ya7nu&utm_medium=ios

OpenAI Operator

https://cdn.openai.com/operator_system_card.pdf
“OpenAI introduces Operator, a research preview of a browser-controlling agent available to Pro users in the U.S. Powered by the Computer-Using Agent (CUA) model, Operator can perform web-based tasks like filling forms, ordering groceries, and creating memes by interacting with graphical interfaces through typing, clicking, and scrolling. The agent leverages GPT-4o’s vision capabilities and reinforcement learning to navigate websites without requiring API integrations.”
“multiple safety features, including user takeover mode for sensitive information, task limitations, and defenses against malicious websites. OpenAI is partnering with companies like DoorDash, Instacart, and Uber to refine the technology, while also exploring public sector applications. “

[6]: Platform - Agent Tooling

read on: - 17 Feb 2025
Agent

In this session, our readings cover:

Required Readings:

eBook: Mastering AI Agents

Learn how to create powerful, reliable AI agents with Galileo’s in-depth eBook
URL
The book is divided into five chapters:
Chapter 1 introduces AI agents, their optimal applications, and scenarios where they might be excessive. It covers various agent types and includes three real-world use cases to illustrate their potential.
Chapter 2 details three frameworks—LangGraph, Autogen, and CrewAI—with evaluation criteria to help choose the best fit. It ends with case studies of companies using these frameworks for specific AI tasks.
Chapter 3 explores the evaluation of an AI agent through a step-by-step example of a finance research agent.
Chapter 4 explores how to measure agent performance across systems, task completion, quality control, and tool interaction, supported by five detailed use cases.
Chapter 5 addresses why many AI agents fail and offers practical solutions for successful AI deployment.

[7]: Platform - Context construction via RAG and Agent

read on: - 12 Feb 2025
RAG Agent

In this session, our readings cover:

Required Readings:

ReAct: Synergizing Reasoning and Acting in Language Models

https://arxiv.org/abs/2210.03629
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
[Submitted on 6 Oct 2022 (v1), last revised 10 Mar 2023 (this version, v3)]
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: this https URL

Agent white paper

https://www.kaggle.com/whitepaper-agents
Google recently published a whitepaper on AI Agents that everyone should read. It covers everything you need to know about this new wave.
- Introduction to AI Agents
- The role of tools in Agents
- Enhancing model performance with targeted learning
- Quick start to Agents with LangChain
- Production applications with Vertex AI Agents

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

[Submitted on 15 Jan 2025]
Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei
Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG

[8]: Platform - Prompting Engineering tools / Prompt Compression

read on: - 10 Feb 2025
Prompting

In this session, our readings cover:

Required Readings:

The Prompt Report: A Systematic Survey of Prompting Techniques

URL
[Submitted on 6 Jun 2024 (v1), last revised 30 Dec 2024 (this version, v5)]
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, Philip Resnik
Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.

Prompt Compression for Large Language Models: A Survey

[Submitted on 16 Oct 2024 (v1), last revised 17 Oct 2024 (this version, v2)]
URL
Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier
Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.

Here is a name list of posts!

BackTop

Platform

Recent Readings for Platform Topics of Foundation Models (since 2022) (Index of Posts):

Here is a detailed list of posts!

[1]: Platform - Model Serving

Required Readings:

Efficient Memory Management for Large Language Model Serving with PagedAttention

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

A Survey on Large Language Model Acceleration based on KV Cache Management

More reading:

Multiple system ML readings

[2]: Platform - Model Customization - instruction tuning / LoRA

Required Readings:

Low-Rank Adaptation for Foundation Models: A Comprehensive Review

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

DoRA: Weight-Decomposed Low-Rank Adaptation

More Readings:

[3]: Platform - VLM Jailbreaking / Probing

Required Readings:

garak: A Framework for Security Probing Large Language Models

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

More Readings:

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Safeguarding Large Language Models: A Survey

[4]: Platform - Model Jailbreaking / Safeguarding

Required Readings:

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

More Readings:

Auditing Prompt Caching in Language Model APIs

New GenAI simulation and evaluation tools in Azure AI Studio

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Beyond Benchmarks: On The False Promise of AI Regulation

[5]: Platform - More agent related

Required Readings:

Agent Tools/Libraries:

one Survey Blogpost on agent2024 …

OpenAI Operator

More Readings:

[6]: Platform - Agent Tooling

Required Readings:

eBook: Mastering AI Agents

More Readings:

[7]: Platform - Context construction via RAG and Agent

Required Readings:

ReAct: Synergizing Reasoning and Acting in Language Models

Agent white paper

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

More Readings:

[8]: Platform - Prompting Engineering tools / Prompt Compression

Required Readings:

The Prompt Report: A Systematic Survey of Prompting Techniques

Prompt Compression for Large Language Models: A Survey

More Readings:

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

Here is a name list of posts!