Agent Evaluation

Benchmarks

In this session, our readings cover:

Required Readings: Agent Benchmarking and Benmarks

More Readings:

New GenAI simulation and evaluation tools in Azure AI Studio

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Beyond Benchmarks: On The False Promise of AI Regulation