2025 Spring UVa CS Generative AI Seminar Lectures Organized by Given Order

No. Title
1 Introduction
2 LLM basics - Preference Alignment
3 Survey - LLM Agents
4 Survey - LLM Agents
5 Tooling for LLM agents
6 Tooling / Action for LLM agents
7 Survey - Agents Applications
8 Survey - Agents Applications
9 Agent - in Healthcare
10 Agent - Perception
11 Agent Brain - Reasoning
12 Agent Brain - Reasoning
13 Agent - Memory
14 Agent - Memory
15 Model Serving for Agents
16 Model Serving for Agents
17 Agent Evaluation
18 Agent Safety
19 Agent - Planning / Test test scaling
20 Agent - Planning
21 Agent - World model
22 Agent - World model
23 Agent - Multiagent collaboration
24 Agent - Multiagent collaboration
25 Agents Optimization
26 Agents Optimization
27 buffer
---- ----

1.Introduction

  • Team: deep learning basics
BasicLLM

Summary of Post :

Background Readings:

Basics of ML and DL:

Basics of deep NLP

  • URL
  • Typical NLP tasks / Challenges / Pipeline
  • f() on natural language
    • Before Deep NLP (Pre 2012) • (BOW / LSI / Topic Modeling LDA )
    • Word2Vec (2013-2016) • (GloVe/ FastText)
    • Recurrent NN (2014-2016) • LSTM
    • Seq2Seq
    • Attention
    • Self-Attention (2016 – now )
    • Transformer (attention only Seq2Seq)
    • BERT / RoBERTa/ XLNet/ GPT / …
  • A good code walk through on transformer at URL

Please click each post's URL shown below to check out its full contents.

2.LLM basics - Preference Alignment

  • Team: Basic Preference Optimization
BasicLLM

Summary of Post :

In this session, our readings cover:

Reading

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

  • [Submitted on 23 Jul 2024]
  • Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James)Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire)Cheng
  • With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.

    Extra Readings:


Please click each post's URL shown below to check out its full contents.

3.Survey - LLM Agents

Agent Components

Summary of Post :

In this session, our readings cover:

Reading on: FOUNDATIONS - The Agent “Basics” Components

Review the core components of LLM agent architectures: Brain (Reasoning Engine), Perception (Input Processing), Memory Systems, Action & Tools, Planning & Orchestration, Multi-Agent Collaboration, and Safety & Evaluation.

┌─────────────────────────────────────────────────────────────┐
│                     AGENT ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  BRAIN (Reasoning Engine) ────────────────────┐            │
│   ↓                                            │            │
│  PERCEPTION (Input Processing) ←───────────────┤            │
│   ↓                                            │            │
│  MEMORY (Context & Knowledge) ←────────────────┤            │
│   ↓                    ↓                       │            │
│  WORLD MODEL (Environment Understanding) ←─────┤            │
│   ↓                                            │            │
│  PLANNING (Task Decomposition) ←───────────────┤            │
│   ↓                                            │            │
│  ACTION (Tool Use & Execution) ←───────────────┤            │
│   ↓                                            │            │
│  MULTI-AGENT (Collaboration) ←─────────────────┤            │
│   ↓                                            │            │
│  SAFETY & EVALUATION ──────────────────────────┘            │
│   ↓                                                          │
│  DEPLOYMENT & SERVING                                        │
│   ↓                                                          │
│  APPLICATIONS                                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Core Component: LLM as the Central Reasoning Engine

Understanding the foundation model that serves as the “brain” of agentic systems - the core reasoning, language understanding, and decision-making capabilities.

Key Concepts: Deep neural networks, transformer architecture, emergent abilities, multimodal capabilities, recent architectural advances

Topic Slide Deck Previous Semester
Introduction to Deep NLP Basics W1.1-deepNNtext 25course
LLM Basics - Emergent Ability and GenAI Platform W1.2-IntroLLMv3 25course
More LLM Basics - A Survey W2.1-moreLLM 25course
LLM Basics Foundation S0-Intro 24course
Survey: LLMs and Multimodal FMs S1-LLM 24course
Recent LLM Basics W13-RecentLLMbasics 24course
Advanced Transformer Architectures W14_LLM_advanced_arch 24course

2025 HIGH-IMPACT PAPERS on this topic

  • a. Large Language Model Agent: A Survey on Methodology, Applications and Challenges (March 2025)
    • Link: https://arxiv.org/abs/2503.21460
    • GitHub: https://github.com/luo-junyu/Awesome-Agent-Papers
    • Framework Coverage: Brain-Perception-Action model, memory systems, planning mechanisms, multi-agent coordination, evolutionary pathways, evaluation methodologies
  • b. A Survey on Large Language Model based Autonomous Agents** (Updated March 2025)
    • arXiv: https://arxiv.org/abs/2308.11432
    • Unified framework: Brain (profiling, memory, planning, action)
    • Extensive application coverage: single-agent, multi-agent, human-agent cooperation
    • Agent societies analysis: behavior, personality, social phenomena
  • c. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? (September 2025)
    • arXiv: https://arxiv.org/abs/2509.16941
    • Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public
    • 1,865 problems from 41 actively maintained repositories
    • Enterprise-level complexity: Tasks requiring hours to days for professional engineers
    • Multi-file modifications: Substantial code changes across repositories
    • Three datasets: Public (11 repos), held-out (12 repos), commercial (18 proprietary repos)
    • Contamination-resistant: GPL-licensed and commercial codebases
  • d. From LLMs to LLM-based Agents for Software Engineering: A Survey (August 2024, Updated 2025)
    • Link: https://arxiv.org/html/2408.02479v2
    • Six key topics: Requirement engineering, code generation, autonomous decision-making, software design, test generation, software maintenance
  • e. LLM-Powered AI Agent Systems and Their Applications in Industry (May 2025)
    • Link: https://arxiv.org/html/2505.16120v1
  • f. A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools (2025)
    • Referenced in: https://github.com/luo-junyu/Awesome-Agent-Papers
    • Comprehensive taxonomy of FMs in materials science
    • Reviews advances, resources, and future directions
    • Integration of agents in materials discovery workflows

More Readings:

A Survey on Large Language Model based Autonomous Agents

  • [Submitted on 22 Aug 2023 (v1), last revised 15 Dec 2024 (this version, v6)]
  • URL
  • Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
  • Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at this https URL. Comments: change several 35 pages, 5 figures, 3 tables

Deploying Foundation Model Powered Agent Services: A Survey

  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen
  • [Submitted on 18 Dec 2024]
  • Foundation model (FM) powered agent services are regarded as a promising solution to develop intelligent and personalized applications for advancing toward Artificial General Intelligence (AGI). To achieve high reliability and scalability in deploying these agent services, it is essential to collaboratively optimize computational and communication resources, thereby ensuring effective resource allocation and seamless service delivery. In pursuit of this vision, this paper proposes a unified framework aimed at providing a comprehensive survey on deploying FM-based agent services across heterogeneous devices, with the emphasis on the integration of model and resource optimization to establish a robust infrastructure for these services. Particularly, this paper begins with exploring various low-level optimization strategies during inference and studies approaches that enhance system scalability, such as parallelism techniques and resource scaling methods. The paper then discusses several prominent FMs and investigates research efforts focused on inference acceleration, including techniques such as model compression and token reduction. Moreover, the paper also investigates critical components for constructing agent services and highlights notable intelligent applications. Finally, the paper presents potential research directions for developing real-time agent services with high Quality of Service (QoS).

Please click each post's URL shown below to check out its full contents.

4.Survey - LLM Agents

Agent Components

Summary of Post :

In this session, our readings cover:

Required Readings:

Extra Readings:


Please click each post's URL shown below to check out its full contents.

5.Tooling for LLM agents

  • Team: connect agents to tools
MCP

Summary of Post :

In this session, our readings cover:

Required Readings: ACTION & TOOL USE

Understanding how agents tooling frameworks and how they execute actions through external tools, APIs, and interfaces.

Core Component: Agent-Computer Interface (ACI) - How Agents Interact with Tools and Systems Key Concepts: Prompt engineering, tool calling, function APIs, agent tooling frameworks, efficient tool use

Topic Slide Deck Previous Semester
Platform - Prompting Engineering Tools / Compression W5.1.Team5-Prompt 25course
Platform - Agent Tooling W6.1-team2-master-ai-agent-book-review 25course
Platform - More Agent Related W6.2-team2-agent24-full 25course
Prompt Engineering W11-team-2-prompt-engineering-2 24course
Bonus Session: KV Cache, Tooling and WMDP W15-KVcahe-WMDP-Tools 24course

2025 HIGH-IMPACT PAPERS on this topic

  • a. AgentGym-RL: Training Agents for Long-Horizon Decision Making (September 2025)
    • https://github.com/WooooDyy/LLM-Agent-Paper-List
    • RL version of AgentGym for learning from interactive environments
    • Interactive frontend for trajectory visualization, multi-turn RL

More Readings:


Please click each post's URL shown below to check out its full contents.

6.Tooling / Action for LLM agents

  • Team: connect agents to tools
MCP

Summary of Post :

In this session, our readings cover:

Required Readings:


Please click each post's URL shown below to check out its full contents.

7.Survey - Agents Applications

Application

Summary of Post :

In this session, our readings cover:

Required Readings: AGENT APPLICATIONS

Core Component: Translating Agent Architectures into Real-World Systems

Focus on how agent capabilities are adapted to specific domains and product workflows, including user experience, operational constraints, and measurable impact.

Key Concepts: Domain adaptation, workflow integration, human-in-the-loop design, reliability in production, evaluation in context, compliance and governance, and case studies (software, education, healthcare, finance, science, and robotics)

Topic Slide Deck Previous Semester
Survey: LLMs and Multimodal FMs S1-LLM 24course
Agent - In Healthcare W9.1-HealthAI-agenticHealth 25course
LLM Agents W12-Team2-LLMAgents 24course
  • a. Deep Research: A Survey of Autonomous Research Agents (August 2025)
    • Link: https://arxiv.org/html/2508.12752v1

    Research Agent Architecture:

    • Planning strategies: World model simulation, modular design search, human-like reasoning synthesis, self-refinement
    • World models: LLMs as implicit world models, graph-based structured knowledge
    • Meta-learning: MPO (Meta-Plan Optimization) - adaptive tuning across environments
    • Architecture search: AgentSquare for automatic pipeline assembly

    DeepResearchBench: Evaluates report fidelity, citation accuracy, comprehensive coverage

    Key Challenge: Plan brittleness, lack of robustness to ambiguous queries, evaluation coarseness

  • b. Towards Scientific Intelligence: LLM-based Scientific Agents (2025)
    • Roadmap for scientific discovery with LLM agents
  • c. A Survey of Data Science Agents (Published October 2025)
    • Venue: Journal of the American Statistical Association
    • Link: https://www.tandfonline.com/doi/full/10.1080/00031305.2025.2561140
    • Comprehensive review of LLM agents for data analysis, visualization, ML workflows
  • d. CitySim: Modeling Urban Behaviors with LLM-Driven Agents (2025)
    • Urban simulation using recursive value-driven approach
    • Scalable agent-based modeling for city dynamics
  • e. From Single-Agent to Multi-Agent: Legal Agents Review (November 2025)
    • Venue: AI Agent Journal 2025
    • Link: https://www.oaepublish.com/articles/aiagent.2025.06
    • Core tasks: Legal information retrieval, QA, judgment prediction, text generation
    • Evaluation benchmarks: LAiW (Chinese practical), UCL-Bench (user-centric), JuDGE (judgment documents)
    • Single-agent challenges: Trustworthiness, explainability, factuality
    • Multi-agent systems: Collaborative reasoning, specialized roles (researcher, analyst, writer)
    • Future directions: Cross-jurisdictional interoperability via legal knowledge graphs, ethical governance
  • f. LitMOF: LLM-Driven Multi-Agent Curation of Materials Database (December 2025)
    • Link: https://arxiv.org/abs/2512.01693
    • Problem: Nearly half of Metal-Organic Framework (MOF) database entries contain structural errors
    • Solution: Multi-agent framework validating crystallographic information from literature
    • Results:
      • Curated LitMOF-DB: 118,464 computation-ready structures
      • Corrected 69% (6,161 MOFs) of invalid entries in CoRE MOF database
      • Discovered 12,646 experimentally reported MOFs absent from existing resources
    • Paradigm: Self-correcting scientific databases through LLM-driven curation
  • g. LongVideoAgent: Multi-Agent Reasoning with Long Videos (December 2025)
    • Link: https://arxiv.org/abs/2512.20618 Architecture:
    • Master agent: Coordinates with step limit, trained via RL
    • Grounding agent: Localizes question-relevant segments
    • Vision agent: Extracts targeted textual observations from video
    • Training: Reinforcement learning to encourage concise, correct, efficient cooperation
    • Benchmark: LongTVQA and LongTVQA+ (episode-level datasets from TVQA/TVQA+)
    • Results: Significantly outperforms non-agent baselines on hour-long video reasoning
  • h. CitySim: Modeling Urban Behaviors with LLM-Driven Agents (2025)
    • Urban simulation using recursive value-driven approach
    • Scalable agent-based modeling for city dynamics
    • Applications in urban planning and policy analysis

Please click each post's URL shown below to check out its full contents.

8.Survey - Agents Applications

Application

Summary of Post :


Please click each post's URL shown below to check out its full contents.

9.Agent - in Healthcare

  • Team: Special Agent
Perception Healthcare

Summary of Post :

In this session, our readings cover:

Required Readings: Adapting Agents to Healthcare

Understanding how agents perceive and process actions in a special domain like healthcare

Key Concepts: Domain-specific perception, multimodal input processing, specialized domain understanding (bio, healthcare, robotics)

Topic Slide Deck Previous Semester
Survey - BioScience LLMs W2.2-bioLM 25course
Survey - FMs in Healthcare W3.1-GenAI-healthcare 25course
Survey - FMs in Robotics W3.2-GenAI-Robotics 25course
Multimodal FMs - Video/Audio W12.1.25-multimodalGenAI 25course
Domain Centered FMs W9-T2-domain-LLM 24course
Agent - In Healthcare W9.1-HealthAI-agenticHealth 25course

2025 HIGH-IMPACT PAPERS on this topic

The rise of agentic AI teammates in medicine

  • Perspectives Digital medicine Volume 405, Issue 10477 vp457 February 08, 2025
  • James Zou jamesz@stanford.edu ∙ Eric J Topol
  • Medicine is in the dawn of a fundamental shift from using artificial intelligence (AI) as tools to deploying AI as agents. When used as a tool, AI is passive and reactive. Even powerful medical AI foundation models today remain tools that depend on human users to provide input and context, interpret their output, and take follow-up steps. To fully unlock AI’s potential in medicine, clinicians need to make the key conceptual shift from using AI as sophisticated calculators to embracing AI as health-care teammates.

Lab-in-the-loop therapeutic antibody design with deep learning

  • https://doi.org/10.1101/2025.02.19.639050
  • Therapeutic antibody design is a complex multi-property optimization problem that traditionally relies on expensive search through sequence space. Here, we introduce “Lab-in-the-loop,” a paradigm shift for antibody design that orchestrates generative machine learning models, multi-task property predictors, active learning ranking and selection, and in vitro experimentation in a semiautonomous, iterative optimization loop. By automating the design of antibody variants, property prediction, ranking and selection of designs to assay in the lab, and ingestion of in vitro data, we enable a holistic, end-to-end approach to antibody optimization. We apply lab-in-the-loop to four clinically relevant antigen targets: EGFR, IL-6, HER2, and OSM. Over 1,800 unique antibody variants are designed and tested, derived from lead molecule candidates obtained via animal immunization and state-of-the-art immune repertoire mining techniques. Four lead candidate and four design crystal structures are solved to reveal mechanistic insights into the effects of mutations. We perform four rounds of iterative optimization and report 3–100× better binding variants for every target and ten candidate lead molecules, with the best binders in a therapeutically relevant 100 pM range.
  • All authors are or were employees of Genentech Inc. (a member of the Roche Group) or Roche, and may hold Roche stock or related interests.

More Readings:

Genome modeling and design across all domains of life with Evo 2

  • Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, View ORCID ProfileAditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, View ORCID ProfileMohammad R.K. Mofrad, Madelena Y. Ng, Jaspreet Pannu, Christopher Ré, Jonathan C. Schmok, John St. John, Jeremy Sullivan, Kevin Zhu, Greg Zynda, Daniel Balsam, Patrick Collison, Anthony B. Costa, Tina Hernandez-Boussard, Eric Ho, Ming-Yu Liu, Thomas McGrath, Kimberly Powell, Dave P. Burke, View ORCID ProfileHani Goodarzi, View ORCID ProfilePatrick D. Hsu, View ORCID ProfileBrian L. Hie
  • doi: https://doi.org/10.1101/2025.02.18.638918
  • All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling

  • Jesse Zhang, Airol A Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, Aidan Winters, Umair Khan, Matthew G. Jones, Vuong Tran, Joseph Pangallo, Efthymia Papalexi, Ajay Sapre, Hoai Nguyen, Oliver Sanderson, Maria Nigos, Olivia Kaplan, Sarah Schroeder, Bryan Hariadi, Simone Marrujo, Crina Curca Alec Salvino, Guillermo Gallareta Olivares, Ryan Koehler, Gary Geiss, Alexander Rosenberg, Charles Roco, Daniele Merico, Nima Alidoust, View ORCID ProfileHani Goodarzi, View ORCID ProfileJohnny Yu
  • doi: https://doi.org/10.1101/2025.02.20.639398
  • Building predictive models of the cell requires systematically mapping how perturbations reshape each cell’s state, function, and behavior. Here, we present Tahoe-100M, a giga-scale single-cell atlas of 100 million transcriptomic profiles measuring how each of 1,100 small-molecule perturbations impact cells across 50 cancer cell lines. Our high-throughput Mosaic platform, composed of a highly diverse and optimally balanced “cell village”, reduces batch effects and enables parallel profiling of thousands of conditions at single-cell resolution at an unprecedented scale. As the largest single-cell dataset to date, Tahoe-100M enables artificial-intelligence (AI)-driven models to learn context-dependent functions, capturing fundamental principles of gene regulation and network dynamics. Although we leverage cancer models and pharmacological compounds to create this resource, Tahoe-100M is fundamentally designed as a broadly applicable perturbation atlas and supports deeper insights into cell biology across multiple tissues and contexts. By publicly releasing this atlas, we aim to accelerate the creation and development of robust AI frameworks for systems biology, ultimately improving our ability to predict and manipulate cellular behaviors across a wide range of applications.

Structure-based drug design with geometric deep learning

  • https://doi.org/10.1016/j.sbi.2023.102548
  • Structure-based drug design uses three-dimensional geometric information of macromolecules, such as proteins or nucleic acids, to identify suitable ligands. Geometric deep learning, an emerging concept of neural-network-based machine learning, has been applied to macromolecular structures. This review provides an overview of the recent applications of geometric deep learning in bioorganic and medicinal chemistry, highlighting its potential for structure-based drug discovery and design. Emphasis is placed on molecular property prediction, ligand binding site and pose prediction, and structure-based de novo molecular design. The current challenges and opportunities are highlighted, and a forecast of the future of geometric deep learning for drug discovery is presented. Questions answered in this article

  • Structure-based drug design is based on methods that leverage three-dimensional (3D) structures of macromolecular targets, such as proteins and nucleic acids, for decision-making in medicinal chemistry [1,2]. Structure-based modeling is well established throughout the drug discovery process, aiming to rationalize non-covalent interactions between ligands and their target macromolecule(s) [3]. The questions addressed with structure-based approaches include molecular property prediction, ligand binding site recognition, binding pose estimation, as well as de novo design [4, 5, 6, 7]. For such tasks, detailed knowledge of the 3D structure of the investigated macromolecular surfaces and ligand–receptor interfaces is essential. Recently, an emerging concept of neural-network-based “artificial intelligence”, geometric deep learning, has been introduced to solve numerous problems in the molecular sciences, including structure-based drug discovery and design [8].

Generative models for molecular discovery: Recent advances and challenges

  • Camille Bilodeau, Wengong Jin, Tommi Jaakkola, Regina Barzilay, Klavs F. Jensen
  • 05 March 2022 https://doi.org/10.1002/wcms.1608Citations
  • Development of new products often relies on the discovery of novel molecules. While conventional molecular design involves using human expertise to propose, synthesize, and test new molecules, this process can be cost and time intensive, limiting the number of molecules that can be reasonably tested. Generative modeling provides an alternative approach to molecular discovery by reformulating molecular design as an inverse design problem. Here, we review the recent advances in the state-of-the-art of generative molecular design and discusses the considerations for integrating these models into real molecular discovery campaigns. We first review the model design choices required to develop and train a generative model including common 1D, 2D, and 3D representations of molecules and typical generative modeling neural network architectures. We then describe different problem statements for molecular discovery applications and explore the benchmarks used to evaluate models based on those problem statements. Finally, we discuss the important factors that play a role in integrating generative models into experimental workflows. Our aim is that this review will equip the reader with the information and context necessary to utilize generative modeling within their domain.

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

  • [Submitted on 4 Oct 2022 (v1), last revised 11 Feb 2023 (this version, v2)]
  • Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola
  • Predicting the binding structure of a small molecule ligand to a protein – a task known as molecular docking – is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), DiffDock maintains significantly higher precision (21.7%). Finally, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.
  • Comments: International Conference on Learning Representations (ICLR 2023)

Background Readings:

Highly accurate protein structure prediction with AlphaFold

  • Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1,2,3,4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10,11,12,13,14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

Evolutionary-scale prediction of atomic-level protein structure with a language model

  • Speedy structures from single sequences: Machine learning methods for protein structure prediction have taken advantage of the evolutionary information present in multiple sequence alignments to derive accurate structural information, but predicting structure accurately from a single sequence is much more difficult. Lin et al. trained transformer protein language models with up to 15 billion parameters on experimental and high-quality predicted structures and found that information about atomic-level structure emerged in the model as it was scaled up. They created ESMFold, a sequence-to-structure predictor that is nearly as accurate as alignment-based methods and considerably faster. The increased speed permitted the generation of a database, the ESM Metagenomic Atlas, containing more than 600 million metagenomic proteins. —MAF
  • Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Accurate prediction of protein structures and interactions using a three-track neural network

  • Science 19 Aug 2021
  • Deep learning takes on protein folding: In 1972, Anfinsen won a Nobel prize for demonstrating a connection between a protein’s amino acid sequence and its three-dimensional structure. Since 1994, scientists have competed in the biannual Critical Assessment of Structure Prediction (CASP) protein-folding challenge. Deep learning methods took center stage at CASP14, with DeepMind’s Alphafold2 achieving remarkable accuracy. Baek et al. explored network architectures based on the DeepMind framework. They used a three-track network to process sequence, distance, and coordinate information simultaneously and achieved accuracies approaching those of DeepMind. The method, RoseTTA fold, can solve challenging x-ray crystallography and cryo–electron microscopy modeling problems and generate accurate models of protein-protein complexes. —VV

  • DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo–electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.

Transformer protein language models are unsupervised structure learners

  • https://doi.org/10.1101/2020.12.15.422761
  • Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1

PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding

  • [Submitted on 5 Jun 2022 (v1), last revised 19 Sep 2022 (this version, v2)]
  • Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang, Chang Ma, Runcheng Liu, Jian Tang
  • We are now witnessing significant progress of deep learning methods in a variety of tasks (or datasets) of proteins. However, there is a lack of a standard benchmark to evaluate the performance of different methods, which hinders the progress of deep learning in this field. In this paper, we propose such a benchmark called PEER, a comprehensive and multi-task benchmark for Protein sEquence undERstanding. PEER provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, protein-protein interaction prediction, and protein-ligand interaction prediction. We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models. In addition, we also investigate the performance of these methods under the multi-task learning setting. Experimental results show that large-scale pre-trained protein language models achieve the best performance for most individual tasks, and jointly training multiple tasks further boosts the performance. The datasets and source codes of this benchmark are all available at this https URL Comments: Accepted by NeurIPS 2022 Dataset and Benchmark Track. a

Please click each post's URL shown below to check out its full contents.

10.Agent - Perception

  • Team: Input module for LLM agents
Perception

Summary of Post :

In this session, our readings cover:

Required Readings:

More Readings:


Please click each post's URL shown below to check out its full contents.

11.Agent Brain - Reasoning

  • Team: world model
Reasoning

Summary of Post :

In this session, our readings cover:

Required Readings: REASONING & COGNITION

Core Component: Advanced Reasoning Capabilities of the Agent Brain

Exploring how agents reason through complex problems, including code generation, mathematical reasoning, and domain-specific reasoning.

Key Concepts: Chain-of-thought reasoning, code generation, mathematical reasoning, self-examination, test-time compute scaling

Topic Slide Deck Previous Semester
Advanced LLM - Code Reasoning W4.1-Gen AI-code 25course
Advanced LLM - Math Reasoning W4.2-LLM-Math-Reasoning 25course
Inference Test Time Scaling Law Week14.1-T5-Test-Time-Scaling 25course
Self-exam LLM and Reasoning W12-team-2-self-exam-LLM 24course

2025 HIGH-IMPACT PAPERS on this topic

  • a. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (January 2025)
    • Authors: DeepSeek-AI (198 authors)
    • Venue: Nature (September 2025) + arXiv
    • arXiv: https://arxiv.org/abs/2501.12948
    • Nature: https://www.nature.com/articles/s41586-025-09422-z
    • HuggingFace: https://huggingface.co/papers/2501.12948
    • GitHub: https://github.com/deepseek-ai/DeepSeek-R1
    • Pure RL approach - Shows reasoning emerges without supervised demonstrations
    • Remarkable results: AIME 2024 accuracy jumped from 15.6% → 71.0% (pass@1) → 86.7% (majority voting), matching OpenAI o1
    • Emergent behaviors: Self-reflection, verification, strategy adaptation, “aha moments”
    • Open source: Released models from 1.5B to 671B parameters
    • Industry impact: Triggered the “reasoning model” race across all major labs
    • Key Innovation: Demonstrates that advanced reasoning patterns emerge naturally through GRPO (Group Relative Policy Optimization) without human-labeled trajectories. The paper shows thinking time scales with performance - agents learn to “think longer” for harder problems.
  • b. Reasoning Language Models: A Blueprint (January 2025)
    • https://arxiv.org/abs/2501.11223
    • Reinforcement learning approaches for reasoning
    • Connects to DeepSeek-R1, Kimi k1.5, and other reasoning models
    • Comprehensive taxonomy of RLVR (Reinforcement Learning with Verifiable Rewards)
    • Discusses emergent reasoning patterns and distillation to smaller models
  • c. Kimi k1.5: Scaling Reinforcement Learning with LLMs (January 2025)
    • Link: https://arxiv.org/abs/2501.12599

    Contribution: Alternative approach to scaling reasoning via RL

    • Complements DeepSeek-R1 with different architectural choices
    • Emphasizes scaling strategies for RL training
    • Addresses computational efficiency in large-scale RL

More Readings:


Please click each post's URL shown below to check out its full contents.

12.Agent Brain - Reasoning

  • Team: world model
Reasoning

Summary of Post :

In this session, our readings cover:

Required Readings:

More Readings:

Large Language Models for Mathematical Reasoning: Progresses and Challenges

  • Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, Wenpeng Yin
  • [Submitted on 31 Jan 2024 (v1), last revised 16 Sep 2024 (this version, v4)]
  • Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

A Survey of Deep Learning for Mathematical Reasoning

  • Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, Kai-Wei Chang
  • [Submitted on 20 Dec 2022 (v1), last revised 22 Jun 2023 (this version, v2)]
  • Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.

Please click each post's URL shown below to check out its full contents.

13.Agent - Memory

  • Team: Memory system for LLM agents
Context

Summary of Post :

In this session, our readings cover:

Required Readings: MEMORY SYSTEMS

Exploring how agents maintain, retrieve, and use information across interactions.

Core Component: Agent Memory Architecture - Context, Knowledge, and Persistence Key Concepts: RAG systems, long-term vs short-term memory, context window management, knowledge augmentation, hallucination mitigation, model editing

Topic Slide Deck Previous Semester
Platform - Context Construction via RAG and Agent W5.2.Team6-RAGagent 25course
Platform - Long Context vs RAG + Hallucination W9.2-Team2-longContext 25course
Knowledge Augmented FMs W8-T1-KnowledgeAugmentedFMs.pdf 24course
LLM Hallucination W9-Team3-P4-hallucination 24course

2025 HIGH-IMPACT PAPERS on this topic

  • a. Memory in the Age of AI Agents: A Survey (2025)
    • GitHub Repository: https://github.com/Shichun-Liu/Agent-Memory-Paper-List
    • Comprehensive Coverage of Memory Systems:
      • MIRIX: Multi-Agent Memory System (July 2025)
      • Hierarchical Memory: Efficient long-term reasoning (July 2025)
      • G-Memory: Tracing memory for multi-agent systems (June 2025)
      • MemGuide: Intent-driven memory selection (May 2025)
      • EverMemOS: Self-organizing memory operating system (January 2026)
      • Key Distinction: Agent memory vs LLM memory vs RAG vs context engineering
    • Major Papers:
      • A-MEM: Agentic Memory for LLM Agents (Feb 2025)
      • WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning (Dec 2025)
      • CAM: Constructivist View of Agentic Memory (Oct 2025)

More Readings:


Please click each post's URL shown below to check out its full contents.

14.Agent - Memory

  • Team: Memory system for LLM agents
Context

Summary of Post :

In this session, our readings cover:

Required Readings:

More Readings:


Please click each post's URL shown below to check out its full contents.

15.Model Serving for Agents

  • Team: Agents with efficient model serving
Efficiency

Summary of Post :

In this session, our readings cover:

Readings: DEPLOYMENT & SERVING

Core Component: Production Infrastructure - Deploying and Serving Agents at Scale

Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability

Topic Slide Deck Previous Semester
Platform - Model Serving W8.2-Model Serving-team6-t5 25course
More Model Serving - SGlang + Chunked Prefill W12.2-Model-Serving 25course
Model Serving - Efficiency Inference W14.2.ModelServing 25course
Model Interpretability for FM W13.2-GenAI-Interpretability 25course
LLM Interpretability, Trust and Knowledge Conflicts W10-T6-LLMInterpretibility 24course

Multiple system ML readings

  • [Scheduling] Chunked Prefill (OSDI’24): This is perhaps the most widely adopted scheduling policy in today’s LLM serving systems, which proposes a simple, straightforward idea but works very well. Since it is optimized from Continuous Batching (OSDI’22).
  • [Disaggregated Serving] Splitwise (ISCA’24) / DistServe (OSDI’24): These two papers share a similar idea, separating prefill/decode across different nodes based on stage-specific characteristics. These are also intuitive ideas and are being merged into vLLM.
  • [KV Cache, Tooling] SGLang (NIPS’24): It is a widely used serving framework, an alternative to vLLM. Or, it is more like a programming language tailored to LLM application developers, greatly simplifying the code they need to write. At the core of it is RadixAttention designed for efficient KV cache reuse.
  • [Disaggregated Serving] Helix (ASPLOS’25): This proposes an optimized LLM sharding strategy in a heterogenous cluster to achieve optimal resource allocation.- Disaggregated Serving] ServerlessLLM (OSDI’24): This proposes an efficient live migration of LLM inference on the cloud without losing efficiency.
  • [Scheduling] SJF (NIPS’24): This proposes a statistics-based online algorithm to approximate shortest-job-first scheduling in online LLM inference.
  • [Offloading] FlexGen (ICML’23): This proposes the first offloading strategy specifically for inference systems.

Auditing Prompt Caching in Language Model APIs

  • [Submitted on 11 Feb 2025]
  • https://arxiv.org/abs/2502.07776
  • Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto
  • Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users’ prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI’s embedding model is a decoder-only Transformer, which was previously not publicly known.

More Readings :

Orca: A Distributed Serving System for Transformer-Based Generative Models

  • Continuous Batching: https://www.usenix.org/system/files/osdi22-yu.pdf
  • Gyeong-In Yu and Joo Seong Jeong, Seoul National University; Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI; Byung-Gon Chun, FriendliAI and Seoul National University
  • Large-scale Transformer-based models trained for generation tasks (e.g., GPT-3) have recently attracted huge interest, emphasizing the need for system support for serving models in this family. Since these models generate a next token in an autoregressive manner, one has to run the model multiple times to process an inference request where each iteration of the model generates a single output token for the request. However, existing systems for inference serving do not perform well on this type of workload that has a multi-iteration characteristic, due to their inflexible scheduling mechanism that cannot change the current batch of requests being processed; requests that have finished earlier than other requests in a batch cannot return to the client, while newly arrived requests have to wait until the current batch completely finishes. In this paper, we propose iteration-level scheduling, a new scheduling mechanism that schedules execution at the granularity of iteration (instead of request) where the scheduler invokes the execution engine to run only a single iteration of the model on the batch. In addition, to apply batching and iteration-level scheduling to a Transformer model at the same time, we suggest selective batching, which applies batching only to a selected set of operations. Based on these two techniques, we have implemented a distributed serving system called ORCA, with additional designs for scalability to models with hundreds of billions of parameters. Our evaluation on a GPT-3 175B model shows that ORCA can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36:9× throughput improvement at the same level of latency.

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

  • FlexGen: https://arxiv.org/pdf/2303.06865 +[Submitted on 13 Mar 2023 (v1), last revised 12 Jun 2023 (this version, v2)]
  • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang
  • The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at this https URL

Neo: https://arxiv.org/pdf/2411.01142

  • [Submitted on 2 Nov 2024]
  • NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
  • Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu
  • Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5×, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.

Shortest Job First: https://arxiv.org/pdf/2408.15792

  • [Submitted on 28 Aug 2024]
  • Efficient LLM Scheduling by Learning to Rank
  • Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang
  • In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption – we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at this https URL

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

  • [Submitted on 1 Aug 2024 (v1), last revised 14 Oct 2024 (this version, v2)]
  • Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
  • While the scaling laws of large language models (LLMs) training have been extensively studied, optimal inference configurations of LLMs remain underexplored. We study inference scaling laws and compute-optimal inference, focusing on the trade-offs between model sizes and generating additional tokens with different inference strategies. As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies such as greedy search, majority voting, best-of-n, weighted voting, and two different tree search algorithms, using different model sizes and compute budgets. Our findings indicate smaller models (e.g., Llemma-7B) can outperform larger models given the same computation budgets, and that smaller models paired with advanced inference algorithms yield Pareto-optimal cost-performance trade-offs. For instance, the Llemma-7B model, equipped with our novel tree search algorithm, consistently outperforms Llemma-34B with standard majority voting on the MATH benchmark across all FLOPs budgets. We hope these findings contribute to a broader understanding of inference scaling laws for LLMs.

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

  • KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
  • With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.

Please click each post's URL shown below to check out its full contents.

16.Model Serving for Agents

  • Team: Agents with efficient model serving
Efficiency

Summary of Post :

In this session, our readings cover:

Required Readings:

More reading:


Please click each post's URL shown below to check out its full contents.

17.Agent Evaluation

  • Team: Benchmarks for evaluating LLM agents
Benchmarks

Summary of Post :

In this session, our readings cover:

Required Readings: Agent Benchmarking and Benmarks

  • OSWorld Leaderboard: https://os-world.github.io/ (Industry standard for computer-use evaluation)
  • WebArena Project: https://webarena.dev/ (Foundational for web agent development)
  • AgentBench GitHub: https://github.com/THUDM/AgentBench

  • a. Evaluation and Benchmarking of LLM Agents: A Survey (July 2025)
    • Link: https://arxiv.org/html/2507.21504v1
    • Comprehensive taxonomy: Evaluation objectives (behavior, capabilities, reliability, safety) × evaluation process (interaction modes, datasets, metrics, tooling, environments)
    • Enterprise focus: Role-based access control, reliability guarantees, long-term interaction, compliance
    • Novel metrics: Consistency (pass@k vs all-k), robustness under input variations
  • b. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (April 2024, Major Updates 2025)
    • arXiv: https://arxiv.org/abs/2404.07972
    • Project: https://os-world.github.io/
    • HuggingFace: https://huggingface.co/spaces/xlanglab/OSWorld
    • First real computer environment benchmark (Ubuntu, Windows, macOS)
    • 369 tasks across real web/desktop apps, file I/O, cross-app workflows
    • Execution-based evaluation with custom scripts per task
    • State-of-the-art results (2025): OpenAI Operator 38%, best open-source ~24%
    • Reveals massive gap between current capabilities and human performance
    • Industry Impact: Became the standard for evaluating computer-use agents (Claude Computer Use, OpenAI Operator, etc.)
  • c. WebArena: A Realistic Web Environment for Building Autonomous Agents (July 2023, Extensive 2025 Extensions)
    • arXiv: https://arxiv.org/abs/2307.13854
    • Project: https://webarena.dev/
    • Extensions: WebChoreArena, ST-WebAgentBench
    • Record performance: IBM CUGA achieved 61.7% (vs 14% in 2023)
    • 812 templated tasks across e-commerce, forums, code repositories, CMS
    • Extensions:
      • WebChoreArena: 532 tedium-focused tasks (top models: 37.8%)
      • ST-WebAgentBench: Safety/trust templates, policy compliance metrics
    • Key insights: Success driven by Planner-Executor-Memory architecture + specialized training data
  • d. AgentBench: Evaluating LLMs as Agents (August 2023, Updated 2025)
    • Venue: ICLR 2024
    • arXiv: https://arxiv.org/abs/2308.03688
    • GitHub: https://github.com/THUDM/AgentBench

    Comprehensive Coverage:

    • 8 environments: Code, game playing, web shopping, digital card games, lateral thinking, household tasks, web browsing, OS interaction
    • Multi-dimensional evaluation: Breadth across domains reveals agent weak spots
    • Function-calling version (2025): Integrated with AgentRL framework
    • VisualAgentBench: Extension for multimodal agents (5 environments, 17 LMMs tested)

More Readings:

New GenAI simulation and evaluation tools in Azure AI Studio

  • https://techcommunity.microsoft.com/blog/aiplatformblog/new-genai-simulation-and-evaluation-tools-in-azure-ai-studio/4253020

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

  • Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu
  • [Submitted on 7 Dec 2024 (v1), last revised 10 Dec 2024 (this version, v2)]
  • The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ‘‘LLMs-as-judges’’. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at this https URL.

Beyond Benchmarks: On The False Promise of AI Regulation

  • [Submitted on 26 Jan 2025] Gabriel Stanovsky, Renana Keydar, Gadi Perl, Eliya Habba The rapid advancement of artificial intelligence (AI) systems in critical domains like healthcare, justice, and social services has sparked numerous regulatory initiatives aimed at ensuring their safe deployment. Current regulatory frameworks, exemplified by recent US and EU efforts, primarily focus on procedural guidelines while presuming that scientific benchmarking can effectively validate AI safety, similar to how crash tests verify vehicle safety or clinical trials validate drug efficacy. However, this approach fundamentally misunderstands the unique technical challenges posed by modern AI systems. Through systematic analysis of successful technology regulation case studies, we demonstrate that effective scientific regulation requires a causal theory linking observable test outcomes to future performance - for instance, how a vehicle’s crash resistance at one speed predicts its safety at lower speeds. We show that deep learning models, which learn complex statistical patterns from training data without explicit causal mechanisms, preclude such guarantees. This limitation renders traditional regulatory approaches inadequate for ensuring AI safety. Moving forward, we call for regulators to reckon with this limitation, and propose a preliminary two-tiered regulatory framework that acknowledges these constraints: mandating human oversight for high-risk applications while developing appropriate risk communication strategies for lower-risk uses. Our findings highlight the urgent need to reconsider fundamental assumptions in AI regulation and suggest a concrete path forward for policymakers and researchers.

Please click each post's URL shown below to check out its full contents.

18.Agent Safety

  • Team: safety for agent LLM
Jailbreaking Safety

Summary of Post :

Required Readings: RISK, SAFETY, Evaluation & GUARDRAILS

Core Component: Agent Safety Systems - Ensuring Reliable, Ethical, and Secure Operation

Addressing safety, alignment, and ethical considerations in agent deployment.

Topic Slide Deck Previous Semester
Platform - Model Jailbreaking / Safeguarding W7.1-team3-jailbreak 25course
Platform - VLM Jailbreaking / Probing W7.2-team4-MMJailbreak-garak 25course
Agent Safety W10.2-team4-agent-safety 25course
LLM Evaluating Framework W3-LLMEvaluation-Team5 24course
GenAI Guardrails W3-Guardrail-Team3 24course
Survey: Human Alignment W4-LLM-Human-Alignment 24course
Survey: AI Risk Framework W5-AI-RiskFramework 24course
FM Copyright Infringement W5-FM-copyright-infrigement 24course
FM Privacy Leakage Issues W6-FM-privacy-leakage 24course
FM Fairness / Bias Issues W6-LLM-Bias-Fairness-Team5 24course
FM Toxicity / Harmful Outputs W7-LLM-harm 24course
LLM Multimodal Harm Responses W7-multimodal-LLMharm 24course
More FM Risk / Extra - Agent Guardrailing W8-Team3-P3-moreRisk.pdf 25course

More Readings:

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies

  • [Submitted on 28 Jul 2024]
  • Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, Philip S. Yu Inspired by the rapid development of Large Language Models (LLMs), LLM agents have evolved to perform complex tasks. LLM agents are now extensively applied across various domains, handling vast amounts of data to interact with humans and execute tasks. The widespread applications of LLM agents demonstrate their significant commercial value; however, they also expose security and privacy vulnerabilities. At the current stage, comprehensive research on the security and privacy of LLM agents is highly needed. This survey aims to provide a comprehensive overview of the newly emerged privacy and security issues faced by LLM agents. We begin by introducing the fundamental knowledge of LLM agents, followed by a categorization and analysis of the threats. We then discuss the impacts of these threats on humans, environment, and other agents. Subsequently, we review existing defensive strategies, and finally explore future trends. Additionally, the survey incorporates diverse case studies to facilitate a more accessible understanding. By highlighting these critical security and privacy issues, the survey seeks to stimulate future research towards enhancing the security and privacy of LLM agents, thereby increasing their reliability and trustworthiness in future applications.

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

  • Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales
  • [Submitted on 3 Feb 2024 (v1), last revised 17 Jun 2024 (this version, v2)]
  • Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that VLLM fine-tuning can cause forgetting of safety alignment previously learned by the underpinning LLM. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories. Our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns VLLMs. This alignment is achieved with minimal impact on, or even enhancement of, the models’ helpfulness. The versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing VLLMs, training new models or safeguarding pre-trained VLLMs. Empirical results demonstrate that fine-tuned VLLMs effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. The code and dataset are available at this https URL.

Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey

  • [Submitted on 12 Jun 2024 (v1), last revised 18 Jun 2024 (this version, v2)]
  • Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Xu Guo, Dayong Ye, Wanlei Zhou, Philip S. Yu
  • With the rapid development of artificial intelligence, large language models (LLMs) have made remarkable advancements in natural language processing. These models are trained on vast datasets to exhibit powerful language understanding and generation capabilities across various applications, including machine translation, chatbots, and agents. However, LLMs have revealed a variety of privacy and security issues throughout their life cycle, drawing significant academic and industrial attention. Moreover, the risks faced by LLMs differ significantly from those encountered by traditional language models. Given that current surveys lack a clear taxonomy of unique threat models across diverse scenarios, we emphasize the unique privacy and security threats associated with five specific scenarios: pre-training, fine-tuning, retrieval-augmented generation systems, deployment, and LLM-based agents. Addressing the characteristics of each risk, this survey outlines potential threats and countermeasures. Research on attack and defense situations can offer feasible research directions, enabling more areas to benefit from LLMs.

Large Language Model Safety: A Holistic Survey

  • Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, Deyi Xiong
  • [Submitted on 23 Dec 2024]
  • The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at this https URL.

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

  • https://arxiv.org/pdf/2410.17520
  • [Submitted on 23 Oct 2024 (v1), last revised 10 Dec 2024 (this version, v2)]
  • Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
  • Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications, challenging agents with managing risks encompassing misuse and negative side effects. These tasks include tests to evaluate the safety of agents in daily scenarios as well as their robustness against indirect prompt injection attacks. Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks. To mitigate these safety concerns, we propose a prompting method that encourages agents to prioritize safety considerations. While this method shows promise in promoting safer behaviors, there is still considerable room for improvement to fully earn user trust. This highlights the urgent need for continued research to develop more robust safety mechanisms in mobile environments. We open-source our benchmark at: this https URL.

Privacy-Preserving Large Language Models: Mechanisms, Applications, and Future Directions

  • Guoshenghui Zhao, Eric Song
  • [Submitted on 9 Dec 2024]
  • The rapid advancement of large language models (LLMs) has revolutionized natural language processing, enabling applications in diverse domains such as healthcare, finance and education. However, the growing reliance on extensive data for training and inference has raised significant privacy concerns, ranging from data leakage to adversarial attacks. This survey comprehensively explores the landscape of privacy-preserving mechanisms tailored for LLMs, including differential privacy, federated learning, cryptographic protocols, and trusted execution environments. We examine their efficacy in addressing key privacy challenges, such as membership inference and model inversion attacks, while balancing trade-offs between privacy and model utility. Furthermore, we analyze privacy-preserving applications of LLMs in privacy-sensitive domains, highlighting successful implementations and inherent limitations. Finally, this survey identifies emerging research directions, emphasizing the need for novel frameworks that integrate privacy by design into the lifecycle of LLMs. By synthesizing state-of-the-art approaches and future trends, this paper provides a foundation for developing robust, privacy-preserving large language models that safeguard sensitive information without compromising performance.

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

  • Zhao Xu, Fan Liu, Hao Liu
  • [Submitted on 13 Jun 2024 (v1), last revised 6 Nov 2024 (this version, v3)]
  • Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced JailTrickBench to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at this https URL.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

  • Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li
  • [Submitted on 5 Jul 2024 (v1), last revised 30 Aug 2024 (this version, v2)]
  • Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of “jailbreaking”, which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

Safeguarding Large Language Models: A Survey

  • [Submitted on 3 Jun 2024]
  • Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, Xiaowei Huang
  • In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as “safeguards” or “guardrails”, has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.

Jailbreaking LLM-Controlled Robots

  • [Submitted on 17 Oct 2024 (v1), last revised 9 Nov 2024 (this version, v2)]
  • Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, George J. Pappas
  • The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce RoboPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, RoboPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that RoboPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: this https URL

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

  • Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
  • Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model’s worst-possible-case behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone. We release models at this https URL

Please click each post's URL shown below to check out its full contents.

19.Agent - Planning / Test test scaling

  • Team: Agents planning
Planning

Summary of Post :

In this session, our readings cover:

Required Readings: PLANNING & ORCHESTRATION

Core Component: Agent Planning Module - Goal Decomposition and Strategy Formation

How agents break down complex tasks, form plans, and orchestrate multi-step workflows, leveraging world models when available. Key Concepts: Task decomposition, planning algorithms (with/without world models), agent workflows, domain-specific planning strategies, plan-then-act vs. continuous replanning

Topic Slide Deck Previous Semester
Agent - Planning / World Model W10.1-Team 3-Planning 25course
Test time scaling Week14.1-T5-Test-Time-Scaling 25course
Platform - Prompting Engineering Tools / Compression W5.1.Team5-Prompt 25course
Prompt Engineering W11-team-2-prompt-engineering-2 24course
LLM Alignment - PPO W11.2-team6-PPO 25course
LLM Post-training W14.3.DPO 25course
Scaling Law and Efficiency W11-ScalinglawEfficientLLM 24course
LLM Fine Tuning W14-LLM-FineTuning 24course

2025 HIGH-IMPACT PAPERS on this topic

  • a. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
    • Referenced in: https://github.com/zjunlp/LLMAgentPapers
    • Taxonomy of agentic RL approaches
    • Training methods: GRPO, PPO variations, RLVR
    • Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
    • Challenges: Reward hacking, sample efficiency, exploration-exploitation
    • Applications: Reasoning, planning, multi-agent coordination
    • Key Papers Covered:
      • GRPO (Group Relative Policy Optimization)
      • History Resampling Policy Optimization (SRPO)
      • PVPO (Pre-Estimated Value-Based Policy Optimization)
  • a. EnCompass: Separating Search from Agent Workflows (December 2025)
    • arXiv: https://arxiv.org/abs/2512.03571
    • Press: https://techxplore.com/news/2025-12-ai-agents-results-large-language.html Key Innovation: Separates search strategy from workflow code
    • Performance: 15-40% accuracy boost on code repository translation
    • Search strategies: Backtracking, parallel exploration, beam search (best: two-level beam search)

    Use Cases: Code translation, digital grid transformation rules

  • b. Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling (December 2025)
    • Link: https://arxiv.org/abs/2512.14474

    Two-Phase Paradigm:

    1. Modeling Phase: LLM constructs explicit model (entities, state variables, actions, constraints)
    2. Solution Phase: Generate plan based on explicit model
      • Reduces constraint violations across medical scheduling, route planning, resource allocation, logic puzzles
      • Outperforms Chain-of-Thought and ReAct
      • Critical finding: Many planning failures stem from representational deficiencies, not reasoning limitations

    Domains Tested: Medical scheduling, route planning, resource allocation, logic puzzles, procedural synthesis

More Readings:

Agent Planning with World Knowledge Model

  • Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
  • [Submitted on 23 May 2024 (v1), last revised 3 Jan 2026 (this version, v4)]
  • NeurIPS 2024
  • Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ``real’’ physical world. Imitating humans’ mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent’s understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development. The code is available at this https URL. Comments: NeurIPS 2024

Please click each post's URL shown below to check out its full contents.

20.Agent - Planning

  • Team: Agents planning
Planning

Summary of Post :

In this session, our readings cover:

Required Readings:

More Readings:


Please click each post's URL shown below to check out its full contents.

21.Agent - World model

  • Team: Understanding enviroments for Agents
Multimodal World model

Summary of Post :

In this session, our readings cover:

Required Readings: WORLD MODELS & ENVIRONMENT UNDERSTANDING

Core Component: Internal Representations - How Agents Model Their Environment

World models enable agents to build internal representations of their environment, predict outcomes, and simulate consequences before taking action. This bridges perception and planning.

Key Concepts: Environment modeling, state representation, predictive models, simulation-based planning, model-based reasoning

World Model Role in Agent Architecture:

  • Input: Receives data from Perception (Phase 3) and Memory (Phase 4)
  • Function: Builds internal representation of environment dynamics and causal relationships
  • Output: Informs Planning (Phase 7) by enabling agents to predict action consequences
  • Use Cases: Robotics, game playing, strategic decision-making, healthcare interventions
Topic Slide Deck Previous Semester
Agent - Planning / World Model W10.1-Team 3-Planning 25course

2025 HIGH-IMPACT PAPERS on this topic

  • b. DreamerV3: Mastering Diverse Control Tasks through World Models
    • Nature (April 2025) / arXiv GitHub
    • A general reinforcement-learning algorithm that outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its behaviour by imagining future scenarios.
    • Dreamer succeeds across domains ranging from robot locomotion and manipulation tasks over Atari games, procedurally generated ProcGen levels, and DMLab tasks to the complex and infinite world of Minecraft.
    • First algorithm to collect diamonds in Minecraft from scratch without human data or curricula
    • Uses Recurrent State-Space Model (RSSM) for latent imagination and planning
  • c. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
    • arXiv GitHub Meta AI
    • The first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments.
    • Post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset enables zero-shot deployment on Franka arms without collecting any data from those environments.
    • V-JEPA 2-AC achieves reach = 100%, manipulation = 60–80% compared to Cosmos’s reach = 80%, manipulation = 0–20%, while being 15× faster (16 seconds/action vs 4 minutes).
    • Predicts in representation space rather than pixel space—key innovation for efficient planning
  • c. NVIDIA Cosmos: World Foundation Model Platform for Physical AI
    • NVIDIA Cosmos Technical Report
    • Open world foundation models (WFMs), guardrails, and data processing libraries to accelerate the development of physical AI for autonomous vehicles (AVs), robots, and video analytics AI agents.
    • WFMs are purpose-built for physical AI research and development, and can generate physics-based videos from a combination of inputs, like text, image and video, as well as robot sensor or motion data.
    • Cosmos Reason—a new open, customizable, 7-billion-parameter reasoning VLM for physical AI and robotics—lets robots and vision AI agents reason like humans using prior knowledge, physics understanding and common sense.
    • Early adopters include 1X, Agility Robotics, Figure AI, Skild AI, Boston Dynamics
  • d. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
    • DeepMind Blog
    • RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
    • Thanks to its VLM backbone, RT-2 can plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like SayCan cannot see the real world and rely entirely on language.
    • Uses PaLM-E and PaLI-X backbones; demonstrates chain-of-thought reasoning for multi-stage semantic reasoning

More Readings:

Video Understanding with Large Language Models: A Survey

  • Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
  • [Submitted on 29 Dec 2023 (v1), last revised 24 Jul 2024 (this version, v4)]
  • With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at this https URL.

Please click each post's URL shown below to check out its full contents.

22.Agent - World model

  • Team: Understanding enviroments for Agents
Multimodal World model

Summary of Post :


Please click each post's URL shown below to check out its full contents.

23.Agent - Multiagent collaboration

  • Team: Multi-Agents
Multiagent

Summary of Post :

In this session, our readings cover:

Required Readings: MULTI-AGENT SYSTEMS

Core Component: Multi-Agent Collaboration - Coordination, Communication, and Collective Intelligence

Understanding how multiple agents work together to solve complex problems. Key Concepts: Agent communication protocols, collaborative problem-solving, role-based coordination, multi-agent architectures

Topic Slide Deck Previous Semester
Agent - Multiagent Collaboration W11.1.Team5-agent 25course
MultiAgent LLMs W13-MultiAgentLLMs 24course

2025 HIGH-IMPACT PAPERS on this topic

  • a. MAR: Multi-Agent Reflexion Improves Reasoning (December 2025)
    • Link: https://arxiv.org/abs/2512.20845
    • Key Idea: Multi-persona debators prevent degeneration of thought
    • Results: 47% EM on HotPot QA, 82.7% on HumanEval
  • b. Towards a Science of Scaling Agent Systems (December 2025)
    • Link: https://arxiv.org/abs/2512.08296

    Quantitative Scaling Laws:

    • 180 configurations tested: 5 architectures (single, independent, centralized, decentralized, hybrid) × 3 LLM families × 4 benchmarks
    • Key findings:
      • Capability saturation: Coordination has diminishing returns above ~45% single-agent baseline
      • Error amplification: Independent agents amplify errors 17.2×, centralized reduces to 4.4×
      • Task dependency: Centralized excels on parallelizable tasks (+80.8%), decentralized on web navigation (+9.2%)
      • Sequential tasks: All multi-agent variants degrade performance by 39-70%
    • Predictive framework: 87% accuracy on held-out configurations
    • Validated on GPT-5.2 (MAE=0.071)
  • c. Multi-Agent Collaboration Mechanisms: A Survey of LLMs (January 2025)
    • Link: https://arxiv.org/abs/2501.06322

    Framework Dimensions:

    • Actors: Agents involved in collaboration
    • Types: Cooperation, competition, coopetition
    • Structures: Peer-to-peer, centralized, distributed
    • Strategies: Role-based, model-based
    • Coordination protocols: Communication patterns
    • Applications: 5G/6G networks, Industry 5.0, question answering, social/cultural settings

Please click each post's URL shown below to check out its full contents.

24.Agent - Multiagent collaboration

  • Team: Multi-Agents
Multiagent

Summary of Post :


Please click each post's URL shown below to check out its full contents.

25.Agents Optimization

  • Team: Agents Optimization
Customization

Summary of Post :

In this session, our readings cover:

Required Readings: MODEL TRAINING & OPTIMIZATION

Core Component: Improving the Agent Brain - Training, Fine-tuning, and Optimization

Techniques for improving model capabilities and efficiency.

Key Concepts: Evaluation frameworks, guardrails, alignment (RLHF, PPO, DPO), risk assessment, jailbreaking defense, fairness, bias mitigation, toxicity prevention, agent safety protocols Key Concepts: Data preparation, instruction tuning, LoRA/DoRA, parameter-efficient fine-tuning, scaling laws, efficiency optimization

Topic Slide Deck Previous Semester
Platform - Model Customization (Instruction Tuning/LoRA) W8.1-LoRA-Team5 25course
LLM Alignment - PPO W11.2-team6-PPO 25course
LLM Post-training W14.3.DPO 25course
Open Source LLM - Mistral Data Preparation W4-OpenSourceLLM 24course
Scaling Law and Efficiency W11-ScalinglawEfficientLLM 24course
LLM Fine Tuning W14-LLM-FineTuning 24course
Model Editing and Disgorgement W10-T5-ModelEditing 24course

2025 HIGH-IMPACT PAPERS on this topic

  • b. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
    • Referenced in: https://github.com/zjunlp/LLMAgentPapers
    • Taxonomy of agentic RL approaches
    • Training methods: GRPO, PPO variations, RLVR
    • Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
    • Challenges: Reward hacking, sample efficiency, exploration-exploitation
    • Applications: Reasoning, planning, multi-agent coordination
    • Key Papers Covered:
      • GRPO (Group Relative Policy Optimization)
      • History Resampling Policy Optimization (SRPO)
      • PVPO (Pre-Estimated Value-Based Policy Optimization)
  • Two papers on RL for discreate diffusion models:
  • A Reparameterized Discrete Diffusion Model for Text Generation / - In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from <7% to ≈90%, even outperforming ARMs with 7× as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.
  • Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions / - This work studies discrete diffusion probabilistic models with applications to natural language generation. We derive an alternative yet equivalent formulation of the sampling from discrete diffusion processes and leverage this insight to develop a family of reparameterized discrete diffusion models. The derived generic framework is highly flexible, offers a fresh perspective of the generation process in discrete diffusion models, and features more effective training and decoding techniques. We conduct extensive experiments to evaluate the text generation capability of our model, demonstrating significant improvements over existing diffusion models. Comments: COLM 2024; Code available at this https URL

More Readings:


Please click each post's URL shown below to check out its full contents.

26.Agents Optimization

  • Team: Agents Optimization
Customization

Summary of Post :


Please click each post's URL shown below to check out its full contents.

27.buffer

  • Team: buffer
Safety Agent

Summary of Post :

In this session, our readings cover:

Required Readings:


Please click each post's URL shown below to check out its full contents.

BackTop