Agent Evaluation

Benchmarks

In this session, our readings cover:

Required Readings: Agent Benchmarking and Benmarks

Topic Slide Deck Previous Semester
Survey: LLMs and Multimodal FMs S1-LLM 24course
Survey - FMs in Robotics W3.2-GenAI-Robotics 25course
Multimodal FMs - Video/Audio W12.1.25-multimodalGenAI 25course
Domain Centered FMs W9-T2-domain-LLM 24course

Required Readings

A few typical

More Readings:

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods