Agent Evaluation

Notes: Benchmarks for evaluating LLM agents

Benchmarks

In this session, our readings cover:

Required Readings: Agent Benchmarking and Benmarks

OSWorld Leaderboard: https://os-world.github.io/ (Industry standard for computer-use evaluation)
WebArena Project: https://webarena.dev/ (Foundational for web agent development)
AgentBench GitHub: https://github.com/THUDM/AgentBench
a. Evaluation and Benchmarking of LLM Agents: A Survey (July 2025)
- Link: https://arxiv.org/html/2507.21504v1
- Comprehensive taxonomy: Evaluation objectives (behavior, capabilities, reliability, safety) × evaluation process (interaction modes, datasets, metrics, tooling, environments)
- Enterprise focus: Role-based access control, reliability guarantees, long-term interaction, compliance
- Novel metrics: Consistency (pass@k vs all-k), robustness under input variations
b. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (April 2024, Major Updates 2025)
- arXiv: https://arxiv.org/abs/2404.07972
- Project: https://os-world.github.io/
- HuggingFace: https://huggingface.co/spaces/xlanglab/OSWorld
- First real computer environment benchmark (Ubuntu, Windows, macOS)
- 369 tasks across real web/desktop apps, file I/O, cross-app workflows
- Execution-based evaluation with custom scripts per task
- State-of-the-art results (2025): OpenAI Operator 38%, best open-source ~24%
- Reveals massive gap between current capabilities and human performance
- Industry Impact: Became the standard for evaluating computer-use agents (Claude Computer Use, OpenAI Operator, etc.)
c. WebArena: A Realistic Web Environment for Building Autonomous Agents (July 2023, Extensive 2025 Extensions)
- arXiv: https://arxiv.org/abs/2307.13854
- Project: https://webarena.dev/
- Extensions: WebChoreArena, ST-WebAgentBench
- Record performance: IBM CUGA achieved 61.7% (vs 14% in 2023)
- 812 templated tasks across e-commerce, forums, code repositories, CMS
- Extensions:
  - WebChoreArena: 532 tedium-focused tasks (top models: 37.8%)
  - ST-WebAgentBench: Safety/trust templates, policy compliance metrics
- Key insights: Success driven by Planner-Executor-Memory architecture + specialized training data
d. AgentBench: Evaluating LLMs as Agents (August 2023, Updated 2025)
- Venue: ICLR 2024
- arXiv: https://arxiv.org/abs/2308.03688
- GitHub: https://github.com/THUDM/AgentBench
Comprehensive Coverage:
- 8 environments: Code, game playing, web shopping, digital card games, lateral thinking, household tasks, web browsing, OS interaction
- Multi-dimensional evaluation: Breadth across domains reveals agent weak spots
- Function-calling version (2025): Integrated with AgentRL framework
- VisualAgentBench: Extension for multimodal agents (5 environments, 17 LMMs tested)

2026 Spring UVA CS - GenAI-Overview

Agent Evaluation

Required Readings: Agent Benchmarking and Benmarks

More Readings:

New GenAI simulation and evaluation tools in Azure AI Studio

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Beyond Benchmarks: On The False Promise of AI Regulation