2025 Spring UVa CS Generative AI Seminar Lectures Organized by Given Order

No.	Title
1	Introduction
2	LLM basics - Preference Alignment
3	More - LLM Alignments
4	Intro - LLM Agents
5	More Intro on Language agents
6	More Intro on Language agents
7	Agents Science Applications
8	Agents HCLS Applications
9	FM/Agent - in Life Science / Pharma
10	Agent - Perception
11	Agent Brain - Reasoning
12	Diffusion models
13	Agent - Memory
14	Agent - Memory
15	Model Serving for Agents
16	Model Serving for Agents
17	Agent Evaluation
18	Agent Safety
19	Agent - Planning / Test time scaling
20	Agent - Planning
21	Agent - World model
22	Agent - World model
23	Agent - Multiagent collaboration
24	Agent - Multiagent collaboration
25	Agents Optimization
26	Agents Optimization
27	buffer

---- ----

1.Introduction

Lecture: 2026-SP-W1.1-IntroLLM.pdf
Version: current
Team: LLM basics

BasicLLM

Summary of Post :

Background Readings:

Basics of ML and DL:

Basics of deep NLP

URL
Typical NLP tasks / Challenges / Pipeline
f() on natural language
- Before Deep NLP (Pre 2012) • (BOW / LSI / Topic Modeling LDA )
- Word2Vec (2013-2016) • (GloVe/ FastText)
- Recurrent NN (2014-2016) • LSTM
- Seq2Seq
- Attention
- Self-Attention (2016 – now )
- Transformer (attention only Seq2Seq)
- BERT / RoBERTa/ XLNet/ GPT / …
A good code walk through on transformer at URL

Please click each post's URL shown below to check out its full contents.

2.LLM basics - Preference Alignment

Lecture: 2026-SP-W1.2-basic-llm-alignment.pdf
Version: current
Team: Basic Preference Optimization

BasicLLM

Summary of Post :

In this session, our readings cover:

Reading

Topic	Slide Deck	Previous Semester
Introduction to Deep NLP Basics	W1.1-deepNNtext	25course
LLM Basics - Emergent Ability and GenAI Platform	W1.2-IntroLLMv3	25course
More LLM Basics - A Survey	W2.1-moreLLM	25course
LLM Basics Foundation	S0-Intro	24course
Survey: LLMs and Multimodal FMs	S1-LLM	24course
Recent LLM Basics	W13-RecentLLMbasics	24course
Advanced Transformer Architectures	W14_LLM_advanced_arch	24course

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

[Submitted on 23 Jul 2024]
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James)Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire)Cheng
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
Extra Readings:

Please click each post's URL shown below to check out its full contents.

3.More - LLM Alignments

Lecture: 2026-SP-W2.1-advanced-llm-alignment.pdf
Version: current
Team: More PO

Agent Components

Summary of Post :

In this session, our readings cover:

Required Readings:

Extra Readings:

Please click each post's URL shown below to check out its full contents.

4.Intro - LLM Agents

Lecture: 2026-SP-W2.2-basic-agent.pdf
Version: current
Team: Agent basics

Agent Components

Summary of Post :

In this session, our readings cover:

Reading on: FOUNDATIONS - The Agent “Basics” Components

Review the core components of LLM agent architectures: Brain (Reasoning Engine), Perception (Input Processing), Memory Systems, Action & Tools, Planning & Orchestration, Multi-Agent Collaboration, and Safety & Evaluation.

┌─────────────────────────────────────────────────────────────┐
│                     AGENT ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  BRAIN (Reasoning Engine) ────────────────────┐            │
│   ↓                                            │            │
│  PERCEPTION (Input Processing) ←───────────────┤            │
│   ↓                                            │            │
│  MEMORY (Context & Knowledge) ←────────────────┤            │
│   ↓                    ↓                       │            │
│  WORLD MODEL (Environment Understanding) ←─────┤            │
│   ↓                                            │            │
│  PLANNING (Task Decomposition) ←───────────────┤            │
│   ↓                                            │            │
│  ACTION (Tool Use & Execution) ←───────────────┤            │
│   ↓                                            │            │
│  MULTI-AGENT (Collaboration) ←─────────────────┤            │
│   ↓                                            │            │
│  SAFETY & EVALUATION ──────────────────────────┘            │
│   ↓                                                          │
│  DEPLOYMENT & SERVING                                        │
│   ↓                                                          │
│  APPLICATIONS                                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Core Component: LLM as the Central Reasoning Engine

Understanding the foundation model that serves as the “brain” of agentic systems - the core reasoning, language understanding, and decision-making capabilities.

Key Concepts: Deep neural networks, transformer architecture, emergent abilities, multimodal capabilities, recent architectural advances

2025 HIGH-IMPACT PAPERS on this topic

a. Large Language Model Agent: A Survey on Methodology, Applications and Challenges (March 2025)
- Link: https://arxiv.org/abs/2503.21460
- GitHub: https://github.com/luo-junyu/Awesome-Agent-Papers
- Framework Coverage: Brain-Perception-Action model, memory systems, planning mechanisms, multi-agent coordination, evolutionary pathways, evaluation methodologies
b. A Survey on Large Language Model based Autonomous Agents** (Updated March 2025)
- arXiv: https://arxiv.org/abs/2308.11432
- Unified framework: Brain (profiling, memory, planning, action)
- Extensive application coverage: single-agent, multi-agent, human-agent cooperation
- Agent societies analysis: behavior, personality, social phenomena

5.More Intro on Language agents

Team: differnt agent frameworks

MCP

Summary of Post :

In this session, our readings cover:

Required Readings: ACTION & TOOL USE

Understanding how agents tooling frameworks and how they execute actions through external tools, APIs, and interfaces.

Core Component: Agent-Computer Interface (ACI) - How Agents Interact with Tools and Systems Key Concepts: Prompt engineering, tool calling, function APIs, agent tooling frameworks, efficient tool use

Topic	Slide Deck	Previous Semester
Platform - Prompting Engineering Tools / Compression	W5.1.Team5-Prompt	25course
Platform - Agent Tooling	W6.1-team2-master-ai-agent-book-review	25course
Platform - More Agent Related	W6.2-team2-agent24-full	25course
Prompt Engineering	W11-team-2-prompt-engineering-2	24course
Bonus Session: KV Cache, Tooling and WMDP	W15-KVcahe-WMDP-Tools	24course

2025 HIGH-IMPACT PAPERS on this topic

c. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? (September 2025)
- arXiv: https://arxiv.org/abs/2509.16941
- Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public
- 1,865 problems from 41 actively maintained repositories
- Enterprise-level complexity: Tasks requiring hours to days for professional engineers
- Multi-file modifications: Substantial code changes across repositories
- Three datasets: Public (11 repos), held-out (12 repos), commercial (18 proprietary repos)
- Contamination-resistant: GPL-licensed and commercial codebases

6.More Intro on Language agents

Team: differnt agent frameworks

MCP

Summary of Post :

In this session, our readings cover:

Required Readings:

a. moltbolt / openclaw @ https://openclaw.ai
- https://github.com/openclaw/openclaw

7.Agents Science Applications

Application

Summary of Post :

In this session, our readings cover:

Required Readings: AGENT APPLICATIONS

Core Component: Translating Agent Architectures into Real-World Systems

Focus on how agent capabilities are adapted to specific domains and product workflows, including user experience, operational constraints, and measurable impact.

Key Concepts: Domain adaptation, workflow integration, human-in-the-loop design, reliability in production, evaluation in context, compliance and governance, and case studies (software, education, healthcare, finance, science, and robotics)

Topic	Slide Deck	Previous Semester
Survey: LLMs and Multimodal FMs	S1-LLM	24course
Survey - FMs in Robotics	W3.2-GenAI-Robotics	25course
Multimodal FMs - Video/Audio	W12.1.25-multimodalGenAI	25course
Domain Centered FMs	W9-T2-domain-LLM	24course
science Agents	W12-Team2-LLMAgents	24course

a. Deep Research: A Survey of Autonomous Research Agents (August 2025)
- Link: https://arxiv.org/html/2508.12752v1
- Research Agent Architecture:
  - Planning strategies: World model simulation, modular design search, human-like reasoning synthesis, self-refinement
  - World models: LLMs as implicit world models, graph-based structured knowledge
  - Meta-learning: MPO (Meta-Plan Optimization) - adaptive tuning across environments
  - Architecture search: AgentSquare for automatic pipeline assembly
- DeepResearchBench: Evaluates report fidelity, citation accuracy, comprehensive coverage
- Key Challenge: Plan brittleness, lack of robustness to ambiguous queries, evaluation coarseness
b. A Survey of LLM-based Agents in Medicine: How far are we from Baymax?
- https://arxiv.org/abs/2502.11211
- Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, Yixuan Yuan
- Large Language Models (LLMs) are transforming healthcare through the development of LLM-based agents that can understand, reason about, and assist with medical tasks. This survey provides a comprehensive review of LLM-based agents in medicine, examining their architectures, applications, and challenges. We analyze the key components of medical agent systems, including system profiles, clinical planning mechanisms, medical reasoning frameworks, and external capacity enhancement. The survey covers major application scenarios such as clinical decision support, medical documentation, training simulations, and healthcare service optimization. We discuss evaluation frameworks and metrics used to assess these agents’ performance in healthcare settings. While LLM-based agents show promise in enhancing healthcare delivery, several challenges remain, including hallucination management, multimodal integration, implementation barriers, and ethical considerations. The survey concludes by highlighting future research directions, including advances in medical reasoning inspired by recent developments in LLM architectures, integration with physical systems, and improvements in training simulations. This work provides researchers and practitioners with a structured overview of the current state and future prospects of LLM-based agents in medicine.
c. Evaluating large language models and agents in healthcare: key challenges in clinical applications
- https://doi.org/10.1016/j.imed.2025.03.002
- Large language models (LLMs) have emerged as transformative tools with significant potential across healthcare and medicine. In clinical settings, they hold promises for tasks ranging from clinical decision support to patient education. Advances in LLM agents further broaden their utility by enabling multimodal processing and multitask handling in complex clinical workflows. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the high-risk nature of healthcare and the complexity of medical data. This paper provides a comprehensive overview of current evaluation practices for LLMs and LLM agents in medicine. We contributed 3 main aspects: First, we summarized data sources used in evaluations, including existing medical resources and manually designed clinical questions, offering a basis for LLM evaluation in medical settings. Second, we analyzed key medical task scenarios: closed-ended tasks, open-ended tasks, image processing tasks, and real-world multitask scenarios involving LLM agents, thereby offering guidance for further research across different medical applications. Third, we compared evaluation methods and dimensions, covering both automated metrics and human expert assessments, while addressing traditional accuracy measures alongside agent-specific dimensions, such as tool usage and reasoning capabilities. Finally, we identified key challenges and opportunities in this evolving field, emphasizing the need for continued research and interdisciplinary collaboration between healthcare professionals and computer scientists to ensure safe, ethical, and effective deployment of LLMs in clinical practice.

Please click each post's URL shown below to check out its full contents.

8.Agents HCLS Applications

HCLS Application

Summary of Post :

Background Readings:

Structure-based drug design with geometric deep learning

https://doi.org/10.1016/j.sbi.2023.102548
Structure-based drug design uses three-dimensional geometric information of macromolecules, such as proteins or nucleic acids, to identify suitable ligands. Geometric deep learning, an emerging concept of neural-network-based machine learning, has been applied to macromolecular structures. This review provides an overview of the recent applications of geometric deep learning in bioorganic and medicinal chemistry, highlighting its potential for structure-based drug discovery and design. Emphasis is placed on molecular property prediction, ligand binding site and pose prediction, and structure-based de novo molecular design. The current challenges and opportunities are highlighted, and a forecast of the future of geometric deep learning for drug discovery is presented. Questions answered in this article
Structure-based drug design is based on methods that leverage three-dimensional (3D) structures of macromolecular targets, such as proteins and nucleic acids, for decision-making in medicinal chemistry [1,2]. Structure-based modeling is well established throughout the drug discovery process, aiming to rationalize non-covalent interactions between ligands and their target macromolecule(s) [3]. The questions addressed with structure-based approaches include molecular property prediction, ligand binding site recognition, binding pose estimation, as well as de novo design [4, 5, 6, 7]. For such tasks, detailed knowledge of the 3D structure of the investigated macromolecular surfaces and ligand–receptor interfaces is essential. Recently, an emerging concept of neural-network-based “artificial intelligence”, geometric deep learning, has been introduced to solve numerous problems in the molecular sciences, including structure-based drug discovery and design [8].

Generative models for molecular discovery: Recent advances and challenges

Camille Bilodeau, Wengong Jin, Tommi Jaakkola, Regina Barzilay, Klavs F. Jensen
05 March 2022 https://doi.org/10.1002/wcms.1608Citations
Development of new products often relies on the discovery of novel molecules. While conventional molecular design involves using human expertise to propose, synthesize, and test new molecules, this process can be cost and time intensive, limiting the number of molecules that can be reasonably tested. Generative modeling provides an alternative approach to molecular discovery by reformulating molecular design as an inverse design problem. Here, we review the recent advances in the state-of-the-art of generative molecular design and discusses the considerations for integrating these models into real molecular discovery campaigns. We first review the model design choices required to develop and train a generative model including common 1D, 2D, and 3D representations of molecules and typical generative modeling neural network architectures. We then describe different problem statements for molecular discovery applications and explore the benchmarks used to evaluate models based on those problem statements. Finally, we discuss the important factors that play a role in integrating generative models into experimental workflows. Our aim is that this review will equip the reader with the information and context necessary to utilize generative modeling within their domain.

9.FM/Agent - in Life Science / Pharma

Team: Life science Agent

Perception Healthcare

Summary of Post :

In this session, our readings cover:

Required Readings:

Understanding how agents perceive and process actions in a special domain like healthcare

Key Concepts: Domain-specific perception, multimodal input processing, specialized domain understanding (bio, healthcare, robotics)

Topic	Slide Deck	Previous Semester
Survey - BioScience LLMs	W2.2-bioLM	25course
Survey - FMs in Healthcare	W3.1-GenAI-healthcare	25course
Agent - In Healthcare	W9.1-HealthAI-agenticHealth	25course

2025 HIGH-IMPACT PAPERS on this topic

Advancing regulatory variant effect prediction with AlphaGenome

Nature 2026 / from Google DeepMind
Model shared https://github.com/google-deepmind/alphagenome_research
Abstract: Deep learning models that predict functional genomic measurements from DNA sequences are powerful tools for deciphering the genetic regulatory code. Existing methods involve a trade-off between input sequence length and prediction resolution, thereby limiting their modality scope and performance1,2,3,4,5. We present AlphaGenome, a unified DNA sequence model, which takes as input 1 Mb of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest available external models in 25 of 26 evaluations of variant effect prediction. The ability of AlphaGenome to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically relevant variants near the TAL1 oncogene6. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.

Genome modeling and design across all domains of life with Evo 2

Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, View ORCID ProfileAditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, View ORCID ProfileMohammad R.K. Mofrad, Madelena Y. Ng, Jaspreet Pannu, Christopher Ré, Jonathan C. Schmok, John St. John, Jeremy Sullivan, Kevin Zhu, Greg Zynda, Daniel Balsam, Patrick Collison, Anthony B. Costa, Tina Hernandez-Boussard, Eric Ho, Ming-Yu Liu, Thomas McGrath, Kimberly Powell, Dave P. Burke, View ORCID ProfileHani Goodarzi, View ORCID ProfilePatrick D. Hsu, View ORCID ProfileBrian L. Hie
doi: https://doi.org/10.1101/2025.02.18.638918
All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling

Jesse Zhang, Airol A Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, Aidan Winters, Umair Khan, Matthew G. Jones, Vuong Tran, Joseph Pangallo, Efthymia Papalexi, Ajay Sapre, Hoai Nguyen, Oliver Sanderson, Maria Nigos, Olivia Kaplan, Sarah Schroeder, Bryan Hariadi, Simone Marrujo, Crina Curca Alec Salvino, Guillermo Gallareta Olivares, Ryan Koehler, Gary Geiss, Alexander Rosenberg, Charles Roco, Daniele Merico, Nima Alidoust, View ORCID ProfileHani Goodarzi, View ORCID ProfileJohnny Yu
doi: https://doi.org/10.1101/2025.02.20.639398
Building predictive models of the cell requires systematically mapping how perturbations reshape each cell’s state, function, and behavior. Here, we present Tahoe-100M, a giga-scale single-cell atlas of 100 million transcriptomic profiles measuring how each of 1,100 small-molecule perturbations impact cells across 50 cancer cell lines. Our high-throughput Mosaic platform, composed of a highly diverse and optimally balanced “cell village”, reduces batch effects and enables parallel profiling of thousands of conditions at single-cell resolution at an unprecedented scale. As the largest single-cell dataset to date, Tahoe-100M enables artificial-intelligence (AI)-driven models to learn context-dependent functions, capturing fundamental principles of gene regulation and network dynamics. Although we leverage cancer models and pharmacological compounds to create this resource, Tahoe-100M is fundamentally designed as a broadly applicable perturbation atlas and supports deeper insights into cell biology across multiple tissues and contexts. By publicly releasing this atlas, we aim to accelerate the creation and development of robust AI frameworks for systems biology, ultimately improving our ability to predict and manipulate cellular behaviors across a wide range of applications.

Background Readings:

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

[Submitted on 4 Oct 2022 (v1), last revised 11 Feb 2023 (this version, v2)]
Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola
Predicting the binding structure of a small molecule ligand to a protein – a task known as molecular docking – is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), DiffDock maintains significantly higher precision (21.7%). Finally, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.
Comments: International Conference on Learning Representations (ICLR 2023)

Highly accurate protein structure prediction with AlphaFold

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1,2,3,4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10,11,12,13,14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

Evolutionary-scale prediction of atomic-level protein structure with a language model

Speedy structures from single sequences: Machine learning methods for protein structure prediction have taken advantage of the evolutionary information present in multiple sequence alignments to derive accurate structural information, but predicting structure accurately from a single sequence is much more difficult. Lin et al. trained transformer protein language models with up to 15 billion parameters on experimental and high-quality predicted structures and found that information about atomic-level structure emerged in the model as it was scaled up. They created ESMFold, a sequence-to-structure predictor that is nearly as accurate as alignment-based methods and considerably faster. The increased speed permitted the generation of a database, the ESM Metagenomic Atlas, containing more than 600 million metagenomic proteins. —MAF
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Accurate prediction of protein structures and interactions using a three-track neural network

Science 19 Aug 2021
Deep learning takes on protein folding: In 1972, Anfinsen won a Nobel prize for demonstrating a connection between a protein’s amino acid sequence and its three-dimensional structure. Since 1994, scientists have competed in the biannual Critical Assessment of Structure Prediction (CASP) protein-folding challenge. Deep learning methods took center stage at CASP14, with DeepMind’s Alphafold2 achieving remarkable accuracy. Baek et al. explored network architectures based on the DeepMind framework. They used a three-track network to process sequence, distance, and coordinate information simultaneously and achieved accuracies approaching those of DeepMind. The method, RoseTTA fold, can solve challenging x-ray crystallography and cryo–electron microscopy modeling problems and generate accurate models of protein-protein complexes. —VV
DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo–electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.

Transformer protein language models are unsupervised structure learners

https://doi.org/10.1101/2020.12.15.422761
Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1

PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding

[Submitted on 5 Jun 2022 (v1), last revised 19 Sep 2022 (this version, v2)]
Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang, Chang Ma, Runcheng Liu, Jian Tang
We are now witnessing significant progress of deep learning methods in a variety of tasks (or datasets) of proteins. However, there is a lack of a standard benchmark to evaluate the performance of different methods, which hinders the progress of deep learning in this field. In this paper, we propose such a benchmark called PEER, a comprehensive and multi-task benchmark for Protein sEquence undERstanding. PEER provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, protein-protein interaction prediction, and protein-ligand interaction prediction. We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models. In addition, we also investigate the performance of these methods under the multi-task learning setting. Experimental results show that large-scale pre-trained protein language models achieve the best performance for most individual tasks, and jointly training multiple tasks further boosts the performance. The datasets and source codes of this benchmark are all available at this https URL Comments: Accepted by NeurIPS 2022 Dataset and Benchmark Track. a

Please click each post's URL shown below to check out its full contents.

10.Agent - Perception

Team: Input module for LLM agents

Perception

Summary of Post :

In this session, our readings cover:

Required Readings:

11.Agent Brain - Reasoning

Team: world model

Reasoning

Summary of Post :

In this session, our readings cover:

Required Readings: REASONING & COGNITION

Core Component: Advanced Reasoning Capabilities of the Agent Brain

Exploring how agents reason through complex problems, including code generation, mathematical reasoning, and domain-specific reasoning.

Key Concepts: Chain-of-thought reasoning, code generation, mathematical reasoning, self-examination, test-time compute scaling

Topic	Slide Deck	Previous Semester
Advanced LLM - Code Reasoning	W4.1-Gen AI-code	25course
Advanced LLM - Math Reasoning	W4.2-LLM-Math-Reasoning	25course
Inference Test Time Scaling Law	Week14.1-T5-Test-Time-Scaling	25course
Self-exam LLM and Reasoning	W12-team-2-self-exam-LLM	24course

2025 HIGH-IMPACT PAPERS on this topic

a. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (January 2025)
- Authors: DeepSeek-AI (198 authors)
- Venue: Nature (September 2025) + arXiv
- arXiv: https://arxiv.org/abs/2501.12948
- Nature: https://www.nature.com/articles/s41586-025-09422-z
- HuggingFace: https://huggingface.co/papers/2501.12948
- GitHub: https://github.com/deepseek-ai/DeepSeek-R1
- Pure RL approach - Shows reasoning emerges without supervised demonstrations
- Remarkable results: AIME 2024 accuracy jumped from 15.6% → 71.0% (pass@1) → 86.7% (majority voting), matching OpenAI o1
- Emergent behaviors: Self-reflection, verification, strategy adaptation, “aha moments”
- Open source: Released models from 1.5B to 671B parameters
- Industry impact: Triggered the “reasoning model” race across all major labs
- Key Innovation: Demonstrates that advanced reasoning patterns emerge naturally through GRPO (Group Relative Policy Optimization) without human-labeled trajectories. The paper shows thinking time scales with performance - agents learn to “think longer” for harder problems.
b. Reasoning Language Models: A Blueprint (January 2025)
- https://arxiv.org/abs/2501.11223
- Reinforcement learning approaches for reasoning
- Connects to DeepSeek-R1, Kimi k1.5, and other reasoning models
- Comprehensive taxonomy of RLVR (Reinforcement Learning with Verifiable Rewards)
- Discusses emergent reasoning patterns and distillation to smaller models
c. Kimi k1.5: Scaling Reinforcement Learning with LLMs (January 2025)
- Link: https://arxiv.org/abs/2501.12599
Contribution: Alternative approach to scaling reasoning via RL
- Complements DeepSeek-R1 with different architectural choices
- Emphasizes scaling strategies for RL training
- Addresses computational efficiency in large-scale RL

12.Diffusion models

Team: other foundation model

Reasoning

Summary of Post :

In this session, our readings cover:

Required Readings:

A Reparameterized Discrete Diffusion Model for Text Generation
- Lin Zheng, Jianbo Yuan, Lei Yu, Lingpeng Kong
- This work studies discrete diffusion probabilistic models with applications to natural language generation. We derive an alternative yet equivalent formulation of the sampling from discrete diffusion processes and leverage this insight to develop a family of reparameterized discrete diffusion models. The derived generic framework is highly flexible, offers a fresh perspective of the generation process in discrete diffusion models, and features more effective training and decoding techniques. We conduct extensive experiments to evaluate the text generation capability of our model, demonstrating significant improvements over existing diffusion models.
- COLM 2024;
- Code available at this URL
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
- ICML oral
- Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen
- In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from <7% to ≈90%, even outperforming ARMs with 7× as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.

13.Agent - Memory

Team: Memory system for LLM agents

Context

Summary of Post :

In this session, our readings cover:

Required Readings: MEMORY SYSTEMS

Exploring how agents maintain, retrieve, and use information across interactions.

Core Component: Agent Memory Architecture - Context, Knowledge, and Persistence Key Concepts: RAG systems, long-term vs short-term memory, context window management, knowledge augmentation, hallucination mitigation, model editing

Topic	Slide Deck	Previous Semester
Platform - Context Construction via RAG and Agent	W5.2.Team6-RAGagent	25course
Platform - Long Context vs RAG + Hallucination	W9.2-Team2-longContext	25course
Knowledge Augmented FMs	W8-T1-KnowledgeAugmentedFMs.pdf	24course
LLM Hallucination	W9-Team3-P4-hallucination	24course

2025 HIGH-IMPACT PAPERS on this topic

a. Memory in the Age of AI Agents: A Survey (2025)
- GitHub Repository: https://github.com/Shichun-Liu/Agent-Memory-Paper-List
- Comprehensive Coverage of Memory Systems:
  - MIRIX: Multi-Agent Memory System (July 2025)
  - Hierarchical Memory: Efficient long-term reasoning (July 2025)
  - G-Memory: Tracing memory for multi-agent systems (June 2025)
  - MemGuide: Intent-driven memory selection (May 2025)
  - EverMemOS: Self-organizing memory operating system (January 2026)
  - Key Distinction: Agent memory vs LLM memory vs RAG vs context engineering
- Major Papers:
  - A-MEM: Agentic Memory for LLM Agents (Feb 2025)
  - WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning (Dec 2025)
  - CAM: Constructivist View of Agentic Memory (Oct 2025)

14.Agent - Memory

Team: Memory system for LLM agents

Context

Summary of Post :

In this session, our readings cover:

Required Readings:

15.Model Serving for Agents

Team: Agents with efficient model serving

Efficiency

Summary of Post :

In this session, our readings cover:

Readings: DEPLOYMENT & SERVING

Core Component: Production Infrastructure - Deploying and Serving Agents at Scale

Understanding the infrastructure and systems for deploying agents in production. Key Concepts: Model serving systems, vLLM, KV cache optimization, inference efficiency, chunked prefill, monitoring and interpretability

Topic	Slide Deck	Previous Semester
Platform - Model Serving	W8.2-Model Serving-team6-t5	25course
More Model Serving - SGlang + Chunked Prefill	W12.2-Model-Serving	25course
Model Serving - Efficiency Inference	W14.2.ModelServing	25course
Model Interpretability for FM	W13.2-GenAI-Interpretability	25course
LLM Interpretability, Trust and Knowledge Conflicts	W10-T6-LLMInterpretibility	24course

Multiple system ML readings

[Scheduling] Chunked Prefill (OSDI’24): This is perhaps the most widely adopted scheduling policy in today’s LLM serving systems, which proposes a simple, straightforward idea but works very well. Since it is optimized from Continuous Batching (OSDI’22).
[Disaggregated Serving] Splitwise (ISCA’24) / DistServe (OSDI’24): These two papers share a similar idea, separating prefill/decode across different nodes based on stage-specific characteristics. These are also intuitive ideas and are being merged into vLLM.
[KV Cache, Tooling] SGLang (NIPS’24): It is a widely used serving framework, an alternative to vLLM. Or, it is more like a programming language tailored to LLM application developers, greatly simplifying the code they need to write. At the core of it is RadixAttention designed for efficient KV cache reuse.
[Disaggregated Serving] Helix (ASPLOS’25): This proposes an optimized LLM sharding strategy in a heterogenous cluster to achieve optimal resource allocation.- Disaggregated Serving] ServerlessLLM (OSDI’24): This proposes an efficient live migration of LLM inference on the cloud without losing efficiency.
[Scheduling] SJF (NIPS’24): This proposes a statistics-based online algorithm to approximate shortest-job-first scheduling in online LLM inference.
[Offloading] FlexGen (ICML’23): This proposes the first offloading strategy specifically for inference systems.

Auditing Prompt Caching in Language Model APIs

[Submitted on 11 Feb 2025]
https://arxiv.org/abs/2502.07776
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto
Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users’ prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI’s embedding model is a decoder-only Transformer, which was previously not publicly known.

16.Model Serving for Agents

Team: Agents with efficient model serving

Efficiency

Summary of Post :

In this session, our readings cover:

Required Readings:

17.Agent Evaluation

Team: Benchmarks for evaluating LLM agents

Benchmarks

Summary of Post :

In this session, our readings cover:

Required Readings: Agent Benchmarking and Benmarks

OSWorld Leaderboard: https://os-world.github.io/ (Industry standard for computer-use evaluation)
WebArena Project: https://webarena.dev/ (Foundational for web agent development)
AgentBench GitHub: https://github.com/THUDM/AgentBench
a. Evaluation and Benchmarking of LLM Agents: A Survey (July 2025)
- Link: https://arxiv.org/html/2507.21504v1
- Comprehensive taxonomy: Evaluation objectives (behavior, capabilities, reliability, safety) × evaluation process (interaction modes, datasets, metrics, tooling, environments)
- Enterprise focus: Role-based access control, reliability guarantees, long-term interaction, compliance
- Novel metrics: Consistency (pass@k vs all-k), robustness under input variations
b. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (April 2024, Major Updates 2025)
- arXiv: https://arxiv.org/abs/2404.07972
- Project: https://os-world.github.io/
- HuggingFace: https://huggingface.co/spaces/xlanglab/OSWorld
- First real computer environment benchmark (Ubuntu, Windows, macOS)
- 369 tasks across real web/desktop apps, file I/O, cross-app workflows
- Execution-based evaluation with custom scripts per task
- State-of-the-art results (2025): OpenAI Operator 38%, best open-source ~24%
- Reveals massive gap between current capabilities and human performance
- Industry Impact: Became the standard for evaluating computer-use agents (Claude Computer Use, OpenAI Operator, etc.)
c. WebArena: A Realistic Web Environment for Building Autonomous Agents (July 2023, Extensive 2025 Extensions)
- arXiv: https://arxiv.org/abs/2307.13854
- Project: https://webarena.dev/
- Extensions: WebChoreArena, ST-WebAgentBench
- Record performance: IBM CUGA achieved 61.7% (vs 14% in 2023)
- 812 templated tasks across e-commerce, forums, code repositories, CMS
- Extensions:
  - WebChoreArena: 532 tedium-focused tasks (top models: 37.8%)
  - ST-WebAgentBench: Safety/trust templates, policy compliance metrics
- Key insights: Success driven by Planner-Executor-Memory architecture + specialized training data
d. AgentBench: Evaluating LLMs as Agents (August 2023, Updated 2025)
- Venue: ICLR 2024
- arXiv: https://arxiv.org/abs/2308.03688
- GitHub: https://github.com/THUDM/AgentBench
Comprehensive Coverage:
- 8 environments: Code, game playing, web shopping, digital card games, lateral thinking, household tasks, web browsing, OS interaction
- Multi-dimensional evaluation: Breadth across domains reveals agent weak spots
- Function-calling version (2025): Integrated with AgentRL framework
- VisualAgentBench: Extension for multimodal agents (5 environments, 17 LMMs tested)

18.Agent Safety

Team: safety for agent LLM

Jailbreaking Safety

Summary of Post :

Required Readings: RISK, SAFETY, Evaluation & GUARDRAILS

Core Component: Agent Safety Systems - Ensuring Reliable, Ethical, and Secure Operation

Addressing safety, alignment, and ethical considerations in agent deployment.

Topic	Slide Deck	Previous Semester
Platform - Model Jailbreaking / Safeguarding	W7.1-team3-jailbreak	25course
Platform - VLM Jailbreaking / Probing	W7.2-team4-MMJailbreak-garak	25course
Agent Safety	W10.2-team4-agent-safety	25course
LLM Evaluating Framework	W3-LLMEvaluation-Team5	24course
GenAI Guardrails	W3-Guardrail-Team3	24course
Survey: Human Alignment	W4-LLM-Human-Alignment	24course
Survey: AI Risk Framework	W5-AI-RiskFramework	24course
FM Copyright Infringement	W5-FM-copyright-infrigement	24course
FM Privacy Leakage Issues	W6-FM-privacy-leakage	24course
FM Fairness / Bias Issues	W6-LLM-Bias-Fairness-Team5	24course
FM Toxicity / Harmful Outputs	W7-LLM-harm	24course
LLM Multimodal Harm Responses	W7-multimodal-LLMharm	24course
More FM Risk / Extra - Agent Guardrailing	W8-Team3-P3-moreRisk.pdf	25course

19.Agent - Planning / Test time scaling

Team: Agents planning

Planning

Summary of Post :

In this session, our readings cover:

Required Readings: PLANNING & ORCHESTRATION

Core Component: Agent Planning Module - Goal Decomposition and Strategy Formation

How agents break down complex tasks, form plans, and orchestrate multi-step workflows, leveraging world models when available. Key Concepts: Task decomposition, planning algorithms (with/without world models), agent workflows, domain-specific planning strategies, plan-then-act vs. continuous replanning

Topic	Slide Deck	Previous Semester
Agent - Planning / World Model	W10.1-Team 3-Planning	25course
Test time scaling	Week14.1-T5-Test-Time-Scaling	25course
Platform - Prompting Engineering Tools / Compression	W5.1.Team5-Prompt	25course
Prompt Engineering	W11-team-2-prompt-engineering-2	24course
LLM Alignment - PPO	W11.2-team6-PPO	25course
LLM Post-training	W14.3.DPO	25course
Scaling Law and Efficiency	W11-ScalinglawEfficientLLM	24course
LLM Fine Tuning	W14-LLM-FineTuning	24course

2025 HIGH-IMPACT PAPERS on this topic

a. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
- Referenced in: https://github.com/zjunlp/LLMAgentPapers
- Taxonomy of agentic RL approaches
- Training methods: GRPO, PPO variations, RLVR
- Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
- Challenges: Reward hacking, sample efficiency, exploration-exploitation
- Applications: Reasoning, planning, multi-agent coordination
- Key Papers Covered:
  - GRPO (Group Relative Policy Optimization)
  - History Resampling Policy Optimization (SRPO)
  - PVPO (Pre-Estimated Value-Based Policy Optimization)
a. EnCompass: Separating Search from Agent Workflows (December 2025)
- arXiv: https://arxiv.org/abs/2512.03571
- Press: https://techxplore.com/news/2025-12-ai-agents-results-large-language.html Key Innovation: Separates search strategy from workflow code
- Performance: 15-40% accuracy boost on code repository translation
- Search strategies: Backtracking, parallel exploration, beam search (best: two-level beam search)
Use Cases: Code translation, digital grid transformation rules
b. Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling (December 2025)
- Link: https://arxiv.org/abs/2512.14474
Two-Phase Paradigm:
1. Modeling Phase: LLM constructs explicit model (entities, state variables, actions, constraints)
2. Solution Phase: Generate plan based on explicit model
  - Reduces constraint violations across medical scheduling, route planning, resource allocation, logic puzzles
  - Outperforms Chain-of-Thought and ReAct
  - Critical finding: Many planning failures stem from representational deficiencies, not reasoning limitations
Domains Tested: Medical scheduling, route planning, resource allocation, logic puzzles, procedural synthesis

20.Agent - Planning

Team: Agents planning

Planning

Summary of Post :

In this session, our readings cover:

Required Readings:

End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available. Comments: Code: this https URL

21.Agent - World model

Team: Understanding environments for Agents

Multimodal World model

Summary of Post :

In this session, our readings cover:

Required Readings: WORLD MODELS & ENVIRONMENT UNDERSTANDING

Core Component: Internal Representations - How Agents Model Their Environment

World models enable agents to build internal representations of their environment, predict outcomes, and simulate consequences before taking action. This bridges perception and planning.

Key Concepts: Environment modeling, state representation, predictive models, simulation-based planning, model-based reasoning

World Model Role in Agent Architecture:

Input: Receives data from Perception (Phase 3) and Memory (Phase 4)
Function: Builds internal representation of environment dynamics and causal relationships
Output: Informs Planning (Phase 7) by enabling agents to predict action consequences
Use Cases: Robotics, game playing, strategic decision-making, healthcare interventions

Topic	Slide Deck	Previous Semester
Agent - Planning / World Model	W10.1-Team 3-Planning	25course

2025 HIGH-IMPACT PAPERS on this topic

a. AgentGym-RL: Training Agents for Long-Horizon Decision Making (September 2025)
- https://github.com/WooooDyy/LLM-Agent-Paper-List
- RL version of AgentGym for learning from interactive environments
- Interactive frontend for trajectory visualization, multi-turn RL
b. DreamerV3: Mastering Diverse Control Tasks through World Models
- Nature (April 2025) / arXiv GitHub
- A general reinforcement-learning algorithm that outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its behaviour by imagining future scenarios.
- Dreamer succeeds across domains ranging from robot locomotion and manipulation tasks over Atari games, procedurally generated ProcGen levels, and DMLab tasks to the complex and infinite world of Minecraft.
- First algorithm to collect diamonds in Minecraft from scratch without human data or curricula
- Uses Recurrent State-Space Model (RSSM) for latent imagination and planning
c. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- arXiv GitHub Meta AI
- The first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments.
- Post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset enables zero-shot deployment on Franka arms without collecting any data from those environments.
- V-JEPA 2-AC achieves reach = 100%, manipulation = 60–80% compared to Cosmos’s reach = 80%, manipulation = 0–20%, while being 15× faster (16 seconds/action vs 4 minutes).
- Predicts in representation space rather than pixel space—key innovation for efficient planning
c. NVIDIA Cosmos: World Foundation Model Platform for Physical AI
- NVIDIA Cosmos Technical Report
- Open world foundation models (WFMs), guardrails, and data processing libraries to accelerate the development of physical AI for autonomous vehicles (AVs), robots, and video analytics AI agents.
- WFMs are purpose-built for physical AI research and development, and can generate physics-based videos from a combination of inputs, like text, image and video, as well as robot sensor or motion data.
- Cosmos Reason—a new open, customizable, 7-billion-parameter reasoning VLM for physical AI and robotics—lets robots and vision AI agents reason like humans using prior knowledge, physics understanding and common sense.
- Early adopters include 1X, Agility Robotics, Figure AI, Skild AI, Boston Dynamics
d. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- DeepMind Blog
- RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
- Thanks to its VLM backbone, RT-2 can plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like SayCan cannot see the real world and rely entirely on language.
- Uses PaLM-E and PaLI-X backbones; demonstrates chain-of-thought reasoning for multi-stage semantic reasoning

22.Agent - World model

Team: Understanding environments for Agents

Multimodal World model

Summary of Post :

Please click each post's URL shown below to check out its full contents.

23.Agent - Multiagent collaboration

Team: Multi-Agents

Multiagent

Summary of Post :

In this session, our readings cover:

Required Readings: MULTI-AGENT SYSTEMS

Core Component: Multi-Agent Collaboration - Coordination, Communication, and Collective Intelligence

Understanding how multiple agents work together to solve complex problems. Key Concepts: Agent communication protocols, collaborative problem-solving, role-based coordination, multi-agent architectures

Topic	Slide Deck	Previous Semester
Agent - Multiagent Collaboration	W11.1.Team5-agent	25course
MultiAgent LLMs	W13-MultiAgentLLMs	24course

2025 HIGH-IMPACT PAPERS on this topic

a. MAR: Multi-Agent Reflexion Improves Reasoning (December 2025)
- Link: https://arxiv.org/abs/2512.20845
- Key Idea: Multi-persona debators prevent degeneration of thought
- Results: 47% EM on HotPot QA, 82.7% on HumanEval
b. Towards a Science of Scaling Agent Systems (December 2025)
- Link: https://arxiv.org/abs/2512.08296
Quantitative Scaling Laws:
- 180 configurations tested: 5 architectures (single, independent, centralized, decentralized, hybrid) × 3 LLM families × 4 benchmarks
- Key findings:
  - Capability saturation: Coordination has diminishing returns above ~45% single-agent baseline
  - Error amplification: Independent agents amplify errors 17.2×, centralized reduces to 4.4×
  - Task dependency: Centralized excels on parallelizable tasks (+80.8%), decentralized on web navigation (+9.2%)
  - Sequential tasks: All multi-agent variants degrade performance by 39-70%
- Predictive framework: 87% accuracy on held-out configurations
- Validated on GPT-5.2 (MAE=0.071)
c. Multi-Agent Collaboration Mechanisms: A Survey of LLMs (January 2025)
- Link: https://arxiv.org/abs/2501.06322
Framework Dimensions:
- Actors: Agents involved in collaboration
- Types: Cooperation, competition, coopetition
- Structures: Peer-to-peer, centralized, distributed
- Strategies: Role-based, model-based
- Coordination protocols: Communication patterns
- Applications: 5G/6G networks, Industry 5.0, question answering, social/cultural settings
e. From Single-Agent to Multi-Agent: Legal Agents Review (November 2025)
- Venue: AI Agent Journal 2025
- Link: https://www.oaepublish.com/articles/aiagent.2025.06
- Core tasks: Legal information retrieval, QA, judgment prediction, text generation
- Evaluation benchmarks: LAiW (Chinese practical), UCL-Bench (user-centric), JuDGE (judgment documents)
- Single-agent challenges: Trustworthiness, explainability, factuality
- Multi-agent systems: Collaborative reasoning, specialized roles (researcher, analyst, writer)
- Future directions: Cross-jurisdictional interoperability via legal knowledge graphs, ethical governance
g. LongVideoAgent: Multi-Agent Reasoning with Long Videos (December 2025)
- Link: https://arxiv.org/abs/2512.20618 Architecture:
- Master agent: Coordinates with step limit, trained via RL
- Grounding agent: Localizes question-relevant segments
- Vision agent: Extracts targeted textual observations from video
- Training: Reinforcement learning to encourage concise, correct, efficient cooperation
- Benchmark: LongTVQA and LongTVQA+ (episode-level datasets from TVQA/TVQA+)
- Results: Significantly outperforms non-agent baselines on hour-long video reasoning

Please click each post's URL shown below to check out its full contents.

24.Agent - Multiagent collaboration

Team: Multi-Agents

Multiagent

Summary of Post :

Please click each post's URL shown below to check out its full contents.

25.Agents Optimization

Team: Agents Optimization

Customization

Summary of Post :

In this session, our readings cover:

Required Readings: MODEL TRAINING & OPTIMIZATION

Core Component: Improving the Agent Brain - Training, Fine-tuning, and Optimization

Techniques for improving model capabilities and efficiency.

Key Concepts: Evaluation frameworks, guardrails, alignment (RLHF, PPO, DPO), risk assessment, jailbreaking defense, fairness, bias mitigation, toxicity prevention, agent safety protocols Key Concepts: Data preparation, instruction tuning, LoRA/DoRA, parameter-efficient fine-tuning, scaling laws, efficiency optimization

Topic	Slide Deck	Previous Semester
Platform - Model Customization (Instruction Tuning/LoRA)	W8.1-LoRA-Team5	25course
LLM Alignment - PPO	W11.2-team6-PPO	25course
LLM Post-training	W14.3.DPO	25course
Open Source LLM - Mistral Data Preparation	W4-OpenSourceLLM	24course
Scaling Law and Efficiency	W11-ScalinglawEfficientLLM	24course
LLM Fine Tuning	W14-LLM-FineTuning	24course
Model Editing and Disgorgement	W10-T5-ModelEditing	24course

2025 HIGH-IMPACT PAPERS on this topic

b. The Landscape of Agentic Reinforcement Learning for LLMs (September 2025)
- Referenced in: https://github.com/zjunlp/LLMAgentPapers
- Taxonomy of agentic RL approaches
- Training methods: GRPO, PPO variations, RLVR
- Policy optimization: Group-in-Group, Stepwise Progress Attribution (SPA-RL)
- Challenges: Reward hacking, sample efficiency, exploration-exploitation
- Applications: Reasoning, planning, multi-agent coordination
- Key Papers Covered:
  - GRPO (Group Relative Policy Optimization)
  - History Resampling Policy Optimization (SRPO)
  - PVPO (Pre-Estimated Value-Based Policy Optimization)
Two papers on RL for discreate diffusion models:
A Reparameterized Discrete Diffusion Model for Text Generation / - In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from <7% to ≈90%, even outperforming ARMs with 7× as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions / - This work studies discrete diffusion probabilistic models with applications to natural language generation. We derive an alternative yet equivalent formulation of the sampling from discrete diffusion processes and leverage this insight to develop a family of reparameterized discrete diffusion models. The derived generic framework is highly flexible, offers a fresh perspective of the generation process in discrete diffusion models, and features more effective training and decoding techniques. We conduct extensive experiments to evaluate the text generation capability of our model, demonstrating significant improvements over existing diffusion models. Comments: COLM 2024; Code available at this https URL

26.Agents Optimization

Team: Prompt Tuning + Agent Customization + Model Optimization + Synthetic Data?

Customization

Summary of Post :

Please click each post's URL shown below to check out its full contents.

27.buffer

Team: buffer

Safety Agent

Summary of Post :

In this session, our readings cover:

Required Readings:

Please click each post's URL shown below to check out its full contents.

BackTop

2025 Spring UVa CS Generative AI Seminar Lectures Organized by Given Order

Summary of Post :

Background Readings:

Basics of ML and DL:

Basics of deep NLP

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Reading

Here are the related slide deck from the previous two course offerings:

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

Extra Readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings:

Extra Readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Reading on: FOUNDATIONS - The Agent “Basics” Components

2025 HIGH-IMPACT PAPERS on this topic

More Readings:

A Survey on Large Language Model based Autonomous Agents

Deploying Foundation Model Powered Agent Services: A Survey

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings: ACTION & TOOL USE

2025 HIGH-IMPACT PAPERS on this topic

More Readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings:

More Readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings: AGENT APPLICATIONS

Here are the related slide deck from the previous two course offerings:

2025 HIGH-IMPACT PAPERS on a related topic:

More readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Background Readings:

Structure-based drug design with geometric deep learning

Generative models for molecular discovery: Recent advances and challenges

More Readings:

The rise of agentic AI teammates in medicine

Lab-in-the-loop therapeutic antibody design with deep learning

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings:

2025 HIGH-IMPACT PAPERS on this topic

Advancing regulatory variant effect prediction with AlphaGenome

Genome modeling and design across all domains of life with Evo 2

Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling

Background Readings:

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Highly accurate protein structure prediction with AlphaFold

Evolutionary-scale prediction of atomic-level protein structure with a language model

Accurate prediction of protein structures and interactions using a three-track neural network

Transformer protein language models are unsupervised structure learners

PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings:

More Readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings: REASONING & COGNITION

2025 HIGH-IMPACT PAPERS on this topic

More Readings:

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings:

More Readings:

Large Language Models for Mathematical Reasoning: Progresses and Challenges

A Survey of Deep Learning for Mathematical Reasoning

Please click each post's URL shown below to check out its full contents.

Summary of Post :

Required Readings: MEMORY SYSTEMS

2025 HIGH-IMPACT PAPERS on this topic

More Readings:

Please click each post's URL shown below to check out its full contents.