Evaluating LLMs and LLM Systems

I designed and led this technical workshop series to help customer teams tackle one of today's core challenges in AI: how to effectively evaluate large language models (LLMs) when traditional machine learning metrics no longer apply. As LLMs become more accessible, the risks of deploying them without rigorous evaluation have only grown. I created this workshop to equip both engineers and technical leaders with a practical framework for measuring LLM performance, managing risk, and building user trust.
The workshop starts by unpacking why LLM evaluation is so different—and so difficult—compared to classic ML. Generative models don't have straightforward "right" answers, so traditional accuracy metrics break down. We explore how this leads to new pitfalls: evaluation methods are surprisingly brittle, often inconsistent, and deeply sensitive to minor changes in prompts or implementation.
Through real-world case studies and hands-on examples, I walk attendees through the four main approaches in use today. First, we examine public benchmarks, discussing how to interpret leaderboard results and avoid common traps. Then, we explore functional correctness—highlighting where automated, test-based evaluation works best (such as code generation) and where its limits lie. Human evaluation is covered in depth as well, focusing on why it remains essential despite cost and subjectivity, and how to scale it sensibly. Finally, we delve into model-based evaluation, illustrating how "LLM-as-a-judge" techniques can enable rapid iteration, while also requiring a careful understanding of their unique biases and failure modes.
A key section of the workshop (covered in the slides) dives into evaluating LLM systems on your own data—including Retrieval-Augmented Generation (RAG) pipelines—with concrete guidance on designing evaluation datasets, selecting meaningful metrics, and balancing automation with human-in-the-loop review.
This series balances theory and pragmatism: it's about establishing robust, repeatable evaluation pipelines that fit the messy realities of modern LLM development. My aim was to demystify evaluation, set realistic expectations, and provide actionable steps for teams deploying LLM-powered applications—whether for search, QA, data extraction, or more.