Agent - In Healthcare

SlideDeck: W9.1-HealthAI-agenticHealth
Version: current
Lead team: TA
Notes: Health AI / Agents

Applications

In this session, our readings cover:

Required Readings:

The rise of agentic AI teammates in medicine

Perspectives Digital medicine Volume 405, Issue 10477 vp457 February 08, 2025
James Zou jamesz@stanford.edu ∙ Eric J Topol
Medicine is in the dawn of a fundamental shift from using artificial intelligence (AI) as tools to deploying AI as agents. When used as a tool, AI is passive and reactive. Even powerful medical AI foundation models today remain tools that depend on human users to provide input and context, interpret their output, and take follow-up steps. To fully unlock AI’s potential in medicine, clinicians need to make the key conceptual shift from using AI as sophisticated calculators to embracing AI as health-care teammates.

Lab-in-the-loop therapeutic antibody design with deep learning

https://doi.org/10.1101/2025.02.19.639050
Therapeutic antibody design is a complex multi-property optimization problem that traditionally relies on expensive search through sequence space. Here, we introduce “Lab-in-the-loop,” a paradigm shift for antibody design that orchestrates generative machine learning models, multi-task property predictors, active learning ranking and selection, and in vitro experimentation in a semiautonomous, iterative optimization loop. By automating the design of antibody variants, property prediction, ranking and selection of designs to assay in the lab, and ingestion of in vitro data, we enable a holistic, end-to-end approach to antibody optimization. We apply lab-in-the-loop to four clinically relevant antigen targets: EGFR, IL-6, HER2, and OSM. Over 1,800 unique antibody variants are designed and tested, derived from lead molecule candidates obtained via animal immunization and state-of-the-art immune repertoire mining techniques. Four lead candidate and four design crystal structures are solved to reveal mechanistic insights into the effects of mutations. We perform four rounds of iterative optimization and report 3–100× better binding variants for every target and ten candidate lead molecules, with the best binders in a therapeutically relevant 100 pM range.
All authors are or were employees of Genentech Inc. (a member of the Roche Group) or Roche, and may hold Roche stock or related interests.

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz
Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI’s o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1’s performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions

Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, Kui Ren
With the advent of Large Language Models (LLMs), medical artificial intelligence (AI) has experienced substantial technological progress and paradigm shifts, highlighting the potential of LLMs to streamline healthcare delivery and improve patient outcomes. Considering this rapid technical progress, in this survey, we trace the recent advances of Medical Large Language Models (Med-LLMs), including the background, key findings, and mainstream techniques, especially for the evolution from general-purpose models to medical-specialized applications. Firstly, we delve into the foundational technology of Med-LLMs, indicating how general models can be progressively adapted and refined for the complicated medical tasks. Secondly, the wide-ranging applications of Med-LLMs are investigated across various healthcare domains, as well as an up-to-date review of existing Med-LLMs. The transformative impact of these models on daily medical practice is evident through their ability to assist clinicians, educators, and patients. Recognizing the importance of responsible innovation, we discuss the challenges associated with ensuring fairness, accountability, privacy, and robustness. Ethical considerations, rigorous evaluation methodologies, and the establishment of regulatory frameworks are crucial for building trustworthiness in the real-world system. We emphasize the need for ongoing scrutiny and development to maintain high standards of safety and reliability. Finally, we anticipate possible future trajectories for Med-LLMs, identifying key avenues for prudent expansion. By consolidating these insights, our review aims to provide professionals and researchers with a thorough understanding of the strengths and limitations of Med-LLMs, fostering a balanced and ethical approach to their integration into the healthcare ecosystem.

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria
The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.

2025 Spring UVA CS - GenAI-Overview

Agent - In Healthcare

Required Readings:

The rise of agentic AI teammates in medicine

Lab-in-the-loop therapeutic antibody design with deep learning

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

More Readings:

MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery