Agent Evaluation
- Notes: Benchmarks for evaluating LLM agents
In this session, our readings cover:
Required Readings: Agent Benchmarking and Benmarks
- OSWorld Leaderboard: https://os-world.github.io/ (Industry standard for computer-use evaluation)
- WebArena Project: https://webarena.dev/ (Foundational for web agent development)
-
AgentBench GitHub: https://github.com/THUDM/AgentBench
- a. Evaluation and Benchmarking of LLM Agents: A Survey (July 2025)
- Link: https://arxiv.org/html/2507.21504v1
- Comprehensive taxonomy: Evaluation objectives (behavior, capabilities, reliability, safety) × evaluation process (interaction modes, datasets, metrics, tooling, environments)
- Enterprise focus: Role-based access control, reliability guarantees, long-term interaction, compliance
- Novel metrics: Consistency (pass@k vs all-k), robustness under input variations
- b. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (April 2024, Major Updates 2025)
- arXiv: https://arxiv.org/abs/2404.07972
- Project: https://os-world.github.io/
- HuggingFace: https://huggingface.co/spaces/xlanglab/OSWorld
- First real computer environment benchmark (Ubuntu, Windows, macOS)
- 369 tasks across real web/desktop apps, file I/O, cross-app workflows
- Execution-based evaluation with custom scripts per task
- State-of-the-art results (2025): OpenAI Operator 38%, best open-source ~24%
- Reveals massive gap between current capabilities and human performance
- Industry Impact: Became the standard for evaluating computer-use agents (Claude Computer Use, OpenAI Operator, etc.)
- c. WebArena: A Realistic Web Environment for Building Autonomous Agents (July 2023, Extensive 2025 Extensions)
- arXiv: https://arxiv.org/abs/2307.13854
- Project: https://webarena.dev/
- Extensions: WebChoreArena, ST-WebAgentBench
- Record performance: IBM CUGA achieved 61.7% (vs 14% in 2023)
- 812 templated tasks across e-commerce, forums, code repositories, CMS
- Extensions:
- WebChoreArena: 532 tedium-focused tasks (top models: 37.8%)
- ST-WebAgentBench: Safety/trust templates, policy compliance metrics
- Key insights: Success driven by Planner-Executor-Memory architecture + specialized training data
- d. AgentBench: Evaluating LLMs as Agents (August 2023, Updated 2025)
- Venue: ICLR 2024
- arXiv: https://arxiv.org/abs/2308.03688
- GitHub: https://github.com/THUDM/AgentBench
Comprehensive Coverage:
- 8 environments: Code, game playing, web shopping, digital card games, lateral thinking, household tasks, web browsing, OS interaction
- Multi-dimensional evaluation: Breadth across domains reveals agent weak spots
- Function-calling version (2025): Integrated with AgentRL framework
- VisualAgentBench: Extension for multimodal agents (5 environments, 17 LMMs tested)
More Readings:
New GenAI simulation and evaluation tools in Azure AI Studio
- https://techcommunity.microsoft.com/blog/aiplatformblog/new-genai-simulation-and-evaluation-tools-in-azure-ai-studio/4253020
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
- Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu
- [Submitted on 7 Dec 2024 (v1), last revised 10 Dec 2024 (this version, v2)]
- The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ‘‘LLMs-as-judges’’. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at this https URL.
Beyond Benchmarks: On The False Promise of AI Regulation
- [Submitted on 26 Jan 2025] Gabriel Stanovsky, Renana Keydar, Gadi Perl, Eliya Habba The rapid advancement of artificial intelligence (AI) systems in critical domains like healthcare, justice, and social services has sparked numerous regulatory initiatives aimed at ensuring their safe deployment. Current regulatory frameworks, exemplified by recent US and EU efforts, primarily focus on procedural guidelines while presuming that scientific benchmarking can effectively validate AI safety, similar to how crash tests verify vehicle safety or clinical trials validate drug efficacy. However, this approach fundamentally misunderstands the unique technical challenges posed by modern AI systems. Through systematic analysis of successful technology regulation case studies, we demonstrate that effective scientific regulation requires a causal theory linking observable test outcomes to future performance - for instance, how a vehicle’s crash resistance at one speed predicts its safety at lower speeds. We show that deep learning models, which learn complex statistical patterns from training data without explicit causal mechanisms, preclude such guarantees. This limitation renders traditional regulatory approaches inadequate for ensuring AI safety. Moving forward, we call for regulators to reckon with this limitation, and propose a preliminary two-tiered regulatory framework that acknowledges these constraints: mandating human oversight for high-risk applications while developing appropriate risk communication strategies for lower-risk uses. Our findings highlight the urgent need to reconsider fundamental assumptions in AI regulation and suggest a concrete path forward for policymakers and researchers.
