Agents in Life Science
- SlideDeck: 2026-SP-W3.3-Team06-BioinfomaticsAgents.pdf
- Version: current
- Notes: Bioinf Agents
In this session, our readings cover:
Required Readings:
The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation
- Kyle Swanson1, Wesley Wu2, Nash L. Bulaong2, John E. Pak2,4, James Zou1,2,3,4
- Correspondence: jamesz@stanford.edu, john.pak@czbiohub.org
- Science frequently benefits from teams of interdisciplinary researchers. However, most scientists don’t have access to experts from multiple fields. Fortunately, large language models (LLMs) have recently shown an impressive ability to aid researchers across diverse domains by answering scientific questions. Here, we expand the capabilities of LLMs for science by introducing the Virtual Lab, an AI-human research collaboration to perform sophisticated, interdisciplinary science research. The Virtual Lab consists of an LLM principal investigator agent guiding a team of LLM agents with different scientific backgrounds (e.g., a chemist agent, a computer scientist agent, a critic agent), with a human researcher providing high-level feedback. We design the Virtual Lab to conduct scientific research through a series of team meetings, where all the agents discuss a scientific agenda, and individual meetings, where an agent accomplishes a specific task. We demonstrate the power of the Virtual Lab by applying it to design nanobody binders to recent variants of SARS-CoV-2, which is a challenging, open-ended research problem that requires reasoning across diverse fields from biology to computer science. The Virtual Lab creates a novel computational nanobody design pipeline that incorporates ESM, AlphaFold-Multimer, and Rosetta and designs 92 new nanobodies. Experimental validation of those designs reveals a range of functional nanobodies with promising binding profiles across SARS-CoV-2 variants. In particular, two new nanobodies exhibit improved binding to the recent JN.1 or KP.3 variants of SARS-CoV-2 while maintaining strong binding to the ancestral viral spike protein, suggesting exciting candidates for further investigation. This demonstrates the ability of the Virtual Lab to rapidly make impactful, real-world scientific discovery.
Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
- ICLR 25
- Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan
- Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at this https URL.
CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis
- Yihang Xiao, Jinyi Liu, YAN ZHENG, Shaoqing Jiao, Jianye HAO, Xiaohan Xie, Limingzhi, Ruitao Wang, Fei Ni, Yuxiao Li, Zhen Wang, Xuequn Shang, Zhijie Bao, Changxiao Yang, Jiajie Peng
- ICLR 2026 Poster
- Keywords: Large Language Models, LLM Agent, Single-cell RNA sequencing, Spatial transcriptomics
- Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) data analysis are pivotal for advancing biological research, enabling precise characterization of cellular heterogeneity. However, existing analysis approaches require extensive manual programming and complex tool integration, posing significant challenges for researchers. To address this, we introduce CellAgent, an autonomous, LLM-driven approach that performs end-to-end scRNA-seq and spatial transcriptomics data analysis through natural language interactions. CellAgent employs a multi-agent hierarchical decision-making framework, simulating a “deep-thinking” workflow to ensure that analytical steps are logically coherent and aligned with the overarching research goal. To further enhance its capabilities, we develop sc-Omni, a high-performance, expert-curated toolkit that consolidates essential tools for scRNA-seq and spatial transcriptomics analysis. Additionally, we introduce a self-reflective optimization mechanism, enabling automated, iterative refinement of results through specialized evaluation methods, effectively replacing traditional manual assessments. Benchmarking against human experts demonstrates that CellAgent achieves significant improvement in efficiency across multiple downstream applications while maintaining excellent performance comparable to existing approaches and preserving natural language interactions. By translating high-level scientific questions into optimized computational workflows, CellAgent represents a step toward a new, more accessible paradigm in bioinformatics, allowing researchers to perform complex data analyses autonomously. In lowering technical barriers, CellAgent serves to advance the democratization of the scientific discovery process in genomics.
More Readings:
Key presentations and publications of the TDC Commons
- Zero-shot Prediction of Therapeutic Use with Geometric Deep Learning and Clinician Centered Design, MedRxiv, 2023 [Paper] [TxGNN Explorer]
- Artificial Intelligence Foundation for Therapeutic Science, Nature Chemical Biology, 2022 [Paper] Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, NeurIPS, 2021 [Paper] [Poster]
- Benchmarking Molecular Machine Learning in Therapeutics Data Commons, ELLIS ML4Molecules, 2021 [Paper] [Slides] Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, Baylearn, 2021 [Slides] [Poster]
- Therapeutics Data Commons, NSF-Harvard Symposium on Drugs for Future Pandemics, 2020 [#futuretx20] [Slides] [Video] TDC User Group Meetup, 2022 [Video] [Agenda]
- Machine Learning to Translate the Cancer Genome and Epigenome Session, AACR Annual Meeting, 2022
- Few-Shot Learning for Network Biology, KDD Workshop on Data Mining in Bioinformatics, 2021 Actionable machine learning for drug discovery and development, Broad Institute, Models, Inference & Algorithms Seminar, 2021
- Graph Neural Networks for Biomedical Data, Machine Learning in Computational Biology, 2020
- Graph Neural Networks for Identifying COVID-19 Drug Repurposing Opportunities, MIT AI Cures, 2020
Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling
- Jesse Zhang, Airol A Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, Aidan Winters, Umair Khan, Matthew G. Jones, Vuong Tran, Joseph Pangallo, Efthymia Papalexi, Ajay Sapre, Hoai Nguyen, Oliver Sanderson, Maria Nigos, Olivia Kaplan, Sarah Schroeder, Bryan Hariadi, Simone Marrujo, Crina Curca Alec Salvino, Guillermo Gallareta Olivares, Ryan Koehler, Gary Geiss, Alexander Rosenberg, Charles Roco, Daniele Merico, Nima Alidoust, View ORCID ProfileHani Goodarzi, View ORCID ProfileJohnny Yu
- doi: https://doi.org/10.1101/2025.02.20.639398
- Building predictive models of the cell requires systematically mapping how perturbations reshape each cell’s state, function, and behavior. Here, we present Tahoe-100M, a giga-scale single-cell atlas of 100 million transcriptomic profiles measuring how each of 1,100 small-molecule perturbations impact cells across 50 cancer cell lines. Our high-throughput Mosaic platform, composed of a highly diverse and optimally balanced “cell village”, reduces batch effects and enables parallel profiling of thousands of conditions at single-cell resolution at an unprecedented scale. As the largest single-cell dataset to date, Tahoe-100M enables artificial-intelligence (AI)-driven models to learn context-dependent functions, capturing fundamental principles of gene regulation and network dynamics. Although we leverage cancer models and pharmacological compounds to create this resource, Tahoe-100M is fundamentally designed as a broadly applicable perturbation atlas and supports deeper insights into cell biology across multiple tissues and contexts. By publicly releasing this atlas, we aim to accelerate the creation and development of robust AI frameworks for systems biology, ultimately improving our ability to predict and manipulate cellular behaviors across a wide range of applications.
