Agent Evaluation
- SlideDeck: 2026-SP-W7.3-AgentEvaluationMa.pdf
- Version: current
- Notes: Benchmarks for evaluating LLM agents
In this session, our readings cover:
Required Readings: Agent Benchmarking and Benmarks
Here are the related slide deck from the previous two course offerings:
| Topic | Slide Deck | Previous Semester |
|---|---|---|
| Survey: LLMs and Multimodal FMs | S1-LLM | 24course |
| Survey - FMs in Robotics | W3.2-GenAI-Robotics | 25course |
| Multimodal FMs - Video/Audio | W12.1.25-multimodalGenAI | 25course |
| Domain Centered FMs | W9-T2-domain-LLM | 24course |
Required Readings
- Survey on Evaluation of LLM-based Agents
- [Submitted on 20 Mar 2025]
- Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
- The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
A few typical
- OSWorld Leaderboard: https://os-world.github.io/ (Industry standard for computer-use evaluation)
- WebArena Project: https://webarena.dev/ (Foundational for web agent development)
-
AgentBench GitHub: https://github.com/THUDM/AgentBench
- a. Evaluation and Benchmarking of LLM Agents: A Survey (July 2025)
- Link: https://arxiv.org/html/2507.21504v1
- Comprehensive taxonomy: Evaluation objectives (behavior, capabilities, reliability, safety) × evaluation process (interaction modes, datasets, metrics, tooling, environments)
- Enterprise focus: Role-based access control, reliability guarantees, long-term interaction, compliance
- Novel metrics: Consistency (pass@k vs all-k), robustness under input variations
- b. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (April 2024, Major Updates 2025)
- arXiv: https://arxiv.org/abs/2404.07972
- Project: https://os-world.github.io/
- HuggingFace: https://huggingface.co/spaces/xlanglab/OSWorld
- First real computer environment benchmark (Ubuntu, Windows, macOS)
- 369 tasks across real web/desktop apps, file I/O, cross-app workflows
- Execution-based evaluation with custom scripts per task
- State-of-the-art results (2025): OpenAI Operator 38%, best open-source ~24%
- Reveals massive gap between current capabilities and human performance
- Industry Impact: Became the standard for evaluating computer-use agents (Claude Computer Use, OpenAI Operator, etc.)
- c. WebArena: A Realistic Web Environment for Building Autonomous Agents (July 2023, Extensive 2025 Extensions)
- arXiv: https://arxiv.org/abs/2307.13854
- Project: https://webarena.dev/
- Extensions: WebChoreArena, ST-WebAgentBench
- Record performance: IBM CUGA achieved 61.7% (vs 14% in 2023)
- 812 templated tasks across e-commerce, forums, code repositories, CMS
- Extensions:
- WebChoreArena: 532 tedium-focused tasks (top models: 37.8%)
- ST-WebAgentBench: Safety/trust templates, policy compliance metrics
- Key insights: Success driven by Planner-Executor-Memory architecture + specialized training data
- d. AgentBench: Evaluating LLMs as Agents (August 2023, Updated 2025)
- Venue: ICLR 2024
- arXiv: https://arxiv.org/abs/2308.03688
- GitHub: https://github.com/THUDM/AgentBench
Comprehensive Coverage:
- 8 environments: Code, game playing, web shopping, digital card games, lateral thinking, household tasks, web browsing, OS interaction
- Multi-dimensional evaluation: Breadth across domains reveals agent weak spots
- Function-calling version (2025): Integrated with AgentRL framework
- VisualAgentBench: Extension for multimodal agents (5 environments, 17 LMMs tested)
More Readings:
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
- Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu
- [Submitted on 7 Dec 2024 (v1), last revised 10 Dec 2024 (this version, v2)]
- The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ‘‘LLMs-as-judges’’. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at this https URL.
