Agent Evaluation

SlideDeck: 2026-SP-W7.3-AgentEvaluationMa.pdf
Version: current
Notes: Benchmarks for evaluating LLM agents

Benchmarks

In this session, our readings cover:

Required Readings: Agent Benchmarking and Benmarks

Topic	Slide Deck	Previous Semester
Survey: LLMs and Multimodal FMs	S1-LLM	24course
Survey - FMs in Robotics	W3.2-GenAI-Robotics	25course
Multimodal FMs - Video/Audio	W12.1.25-multimodalGenAI	25course
Domain Centered FMs	W9-T2-domain-LLM	24course

Required Readings

Survey on Evaluation of LLM-based Agents
- [Submitted on 20 Mar 2025]
- Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
- The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

A few typical

OSWorld Leaderboard: https://os-world.github.io/ (Industry standard for computer-use evaluation)
WebArena Project: https://webarena.dev/ (Foundational for web agent development)
AgentBench GitHub: https://github.com/THUDM/AgentBench
a. Evaluation and Benchmarking of LLM Agents: A Survey (July 2025)
- Link: https://arxiv.org/html/2507.21504v1
- Comprehensive taxonomy: Evaluation objectives (behavior, capabilities, reliability, safety) × evaluation process (interaction modes, datasets, metrics, tooling, environments)
- Enterprise focus: Role-based access control, reliability guarantees, long-term interaction, compliance
- Novel metrics: Consistency (pass@k vs all-k), robustness under input variations
b. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (April 2024, Major Updates 2025)
- arXiv: https://arxiv.org/abs/2404.07972
- Project: https://os-world.github.io/
- HuggingFace: https://huggingface.co/spaces/xlanglab/OSWorld
- First real computer environment benchmark (Ubuntu, Windows, macOS)
- 369 tasks across real web/desktop apps, file I/O, cross-app workflows
- Execution-based evaluation with custom scripts per task
- State-of-the-art results (2025): OpenAI Operator 38%, best open-source ~24%
- Reveals massive gap between current capabilities and human performance
- Industry Impact: Became the standard for evaluating computer-use agents (Claude Computer Use, OpenAI Operator, etc.)
c. WebArena: A Realistic Web Environment for Building Autonomous Agents (July 2023, Extensive 2025 Extensions)
- arXiv: https://arxiv.org/abs/2307.13854
- Project: https://webarena.dev/
- Extensions: WebChoreArena, ST-WebAgentBench
- Record performance: IBM CUGA achieved 61.7% (vs 14% in 2023)
- 812 templated tasks across e-commerce, forums, code repositories, CMS
- Extensions:
  - WebChoreArena: 532 tedium-focused tasks (top models: 37.8%)
  - ST-WebAgentBench: Safety/trust templates, policy compliance metrics
- Key insights: Success driven by Planner-Executor-Memory architecture + specialized training data
d. AgentBench: Evaluating LLMs as Agents (August 2023, Updated 2025)
- Venue: ICLR 2024
- arXiv: https://arxiv.org/abs/2308.03688
- GitHub: https://github.com/THUDM/AgentBench
Comprehensive Coverage:
- 8 environments: Code, game playing, web shopping, digital card games, lateral thinking, household tasks, web browsing, OS interaction
- Multi-dimensional evaluation: Breadth across domains reveals agent weak spots
- Function-calling version (2025): Integrated with AgentRL framework
- VisualAgentBench: Extension for multimodal agents (5 environments, 17 LMMs tested)

2026 Spring UVA CS - GenAI-Overview

Agent Evaluation

Required Readings: Agent Benchmarking and Benmarks

Required Readings

A few typical

More Readings:

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

2026 Spring UVA CS - GenAI-Overview

Agent Evaluation

Required Readings: Agent Benchmarking and Benmarks

Here are the related slide deck from the previous two course offerings:

Required Readings

A few typical

More Readings:

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods