A Synthesis of LLM Evaluation
[ llm evals ai-agents testing python ] · 15 min read

I have been reading a ton about LLM evaluation practices over the past few weeks from Anthropic’s engineering blog, Hamel Husain’s practitioner-focused guides, the Evals for AI Engineers book by Shreya Shankar and Hamel Husain, and several eval framework docs. I wanted to write what I learned and have a synthesis of my understanding of the topic.

This is in no way comprehensive, rather it’s my personal reference piece on the topic.

TL;DR:

Why Evals Matter

Evals are the testing mechanism for LLM powered applications. Without evals, every change to a prompt, model, or retrieval pipeline could improve one thing and quietly break three others. Evaluation provides a concrete way to establish a baseline and measure system efficacy and reliability.

This is similar to having tests as part of the traditional production software systems, with the natural evolution to cater to the non-deterministic nature of LLMs.

Terminology Quick Reference

Different frameworks, docs, articles use different names for the same concepts. This tripped me up during research. Here’s my attempt to reconcile these:

ConceptAnthropicpydantic-evalsOpenAI
Single test scenarioTaskCaseTest case
Collection of test casesDataset / Eval SuiteDatasetEval
One execution attemptTrialsingle run of evaluateRun
Scoring logicGraderEvaluatorGrader
Execution recordTranscript / Tracespan_treeTrace
Known correct answerReference Solutionexpected_outputExpected
Run infrastructureEval Harnessdataset.evaluate_sync()Evals Platform

Key differences: Anthropic emphasizes running the same task multiple times (pass@k) and trial isolation. Pydantic-evals runs each case once by default and provides span_tree via OpenTelemetry for trajectory access. Isolation depends on your implementation in most frameworks.

Core Concepts

Let’s clarify some of the concepts and terminologies first for better comprehension:

What You’re Evaluating (The Agent)

The Agent is the system under test. We would ideally want to evaluate it on multiple axes:

One important nuance on trajectory evaluation. There is a natural instinct to check that agents followed very specific steps. Anthropic warns against this: if it is too rigid, it might discourage novel/interesting discoveries that might emerge. The better approach (and the consensus across everything I read) is to grade what the agent produced (postconditions), not the exact path it took. Reserve trajectory checks for safety invariants (forbidden tools) and critical ordering constraints (e.g., auth before write). For everything else, trajectory is informational.

Evaluation strategy could be grouped into two different categories:

I think this could help in preparing monitoring/observability metrics and dashboards. Grouping different test cases in the above categories will help quickly measure and observe overall system performance and degradations.

How You’re Evaluating (The Grading Pipeline)

There are three types of evaluation mechanisms:

  1. Deterministic graders (code-based) - This should be the foundation of any evaluation strategy and used as much as possible. These are essentially “unit tests” like checks.

    For example, an agent that produces an Excel from a source data, we can easily assert how many rows/columns the Excel should have or does the Excel schema match the expected schema. These are essentially cheap, fast, reproducible, and CI-friendly checks. Key thing to remember here - If you can write a code check for it, do not use an LLM judge.

    A good question someone asked me when I was presenting this idea was “In a CI environment, you would still have to call an LLM to get the agent output, then it’s not really a unit-test like thing, because stubbing the LLM call doesn’t actually test the agent, because the response is what we are trying to test and the response is not deterministic that we can just mock”

    I think that’s a fair question, we need to call the model inference to get the agent response. But I think we can reduce the inference call by reusing a response to test multiple checks at once. We don’t necessarily need to call LLM for every test. Although a pass@k or pass^k mechanism would require multiple calls. More on this later.

  2. LLM-as-Judge (model-graded) - This is useful for evaluating subjective dimensions only. Coherence, helpfulness, tone, reasoning quality. More on this below.

  3. Human graders - This is last line of evaluation and used for calibration, edge cases, and validating the other two. Expensive but essential as the root of trust.

For scoring, deterministic graders act as binary gates (all must pass). The LLM judge contributes a weighted score:

def compute_final_score(deterministic_results: list[bool],
                        summary_judge_pass: bool) -> dict:
    if not all(deterministic_results):
        return {"pass": False, "reason": "deterministic_failure"}

    return {
        "pass": True,
        "summary_quality": "PASS" if summary_judge_pass else "FAIL",
        "overall_score": 1.0 if summary_judge_pass else 0.8,
    }

What Grounds Everything (Datasets)

Two distinct datasets serve two distinct purposes:

DatasetStructurePurpose
Agent eval dataset(input, expected_output) pairsMeasures agent quality
Judge calibration dataset(agent_output, human_quality_label) pairsMeasures judge accuracy

These are different datasets. The agent eval dataset answers “is my agent good?” The judge calibration dataset answers “is my judge trustworthy?”.

The expected_output is not limited to exact string matches. It can be structured data for multi-axis scoring, a natural-language reference for an LLM judge, or a verification function.

Someone recently told me, when you are building agentic systems, eval dataset and harness is your moat. It’s easy to hook up an LLM, a few tools and run these in a loop and you have an agentic system. Anyone can build this. What differentiates is the curation of the eval datasets, harness that graduates an impressive prototype to a reliable production system

Development Loop

In broad strokes, adding and maintaining evals in an LLM powered application is as follows:

  1. Analyze - examine traces to find where the system fails
  2. Measure - build targeted evaluators that quantify those failures
  3. Improve - experiment with changes (prompts, models, retrieval strategies)
  4. Automate - turn confirmed fixes into regression tests that prevent backsliding

Each rotation makes your system more reliable, increases efficacy and helps to quantify both.

This is not the only development loop for evals though. The Evals for AI Engineers book recommends a different albeit similar approach called Error Analysis

Error Analysis: The Core Development Loop

This seems the most comprehensive way from what I can gather. But be aware that this process is very involved.

This is a development methodology for AI agents. Everything else (datasets, judges, prompts) grows out of this loop.

The Workflow

1. Collect traces (~100, both successes and failures)
2. Annotate (human reads each trace, notes failures — no LLMs here)
3. Cluster (group annotations into failure modes)
4. Act (fix agent, refine judge, expand datasets)
5. Verify (run evals to confirm fixes)
→ Repeat

Collect: Each trace should capture the input, all tool calls, intermediate reasoning, and final output. Observability infrastructure is a prerequisite; without traces, meaningful error analysis is impossible.

Annotate: One (or multiple) human SME reads each trace carefully, writing brief notes about anything surprising, incorrect, or wrong-feeling. No LLMs in this step. The goal is to find patterns, not just individual failures.

Cluster: Group similar annotations into coherent failure modes (e.g., “hallucinated tool calls”, “lost context after 5 turns”, “wrong tool for date queries”). LLMs can help with initial clustering, but humans must validate. Typically 2-3 rounds to stabilize the taxonomy.

Act: For each failure mode, triage: is this an agent problem, a judge problem, or a dataset gap? In turn, you adjust prompts, tools and/or eval rubrics/criteria.

Every failure discovered during error analysis feeds all three components simultaneously, where each pass improves the agent, the judge, and the datasets together.

Roughly: Agent Problem: Adjust the agent prompt, tools. Judge Problem: Adjust judge rubrics/criteria. Dataset Problem: Add and/or update tests.

LLM-as-a-Judge

This is a broad topic and quite interesting, not to mention the most nuanced. It’s useful to test subjective qualities of an agent’s responses.

The consensus across practitioners is clear: use LLM judges only for qualities that resist reduction to code checks. They should NOT be used for:

Rubrics and Dimensions

A “rubric” is the instruction text given to the LLM judge defining how to assess a dimension. Dimension = what you grade. Rubric = the grading criteria sheet. For example:

Key design principle from the sources: Use pass/fail judgments over point ratings. Likert scales (1-5) produce inconsistent, unactionable results. Instead of one judge rating “quality: 3/5”, decompose into multiple focused binary judges, each with their own rubric (accuracy: PASS/FAIL, groundedness: PASS/FAIL, completeness: PASS/FAIL). Far more actionable. The judge model should be at least as capable as the model being evaluated and should return structured output (JSON with score, reasoning, citations).

For example, for a data cleaning agent, the summary quality judge looks like this:

summary_judge.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

@dataclass
class SummaryQualityJudge(Evaluator[dict, dict, None]):
    RUBRIC = """
    Evaluate the human-readable summary produced by a data cleaning agent.
    You have access to the structured change log and the agent's summary.

    Criteria:
    1. ACCURACY — Every claim must match the structured change log.
       Must include SPECIFIC COUNTS for quantitative operations.
       Vague statements like "removed duplicates" without counts are a FAIL.
    2. COMPLETENESS — All major operations must be mentioned.
    3. CLARITY — Understandable to a non-technical user.

    PASS if all three criteria are met.
    FAIL if any is not met — explain which and why.

    Respond in JSON: {"score": "PASS" or "FAIL", "reasoning": "..."}
    """

    def evaluate(self, ctx: EvaluatorContext[dict, dict, None]) -> bool:
        # Call LLM with self.RUBRIC, the agent's summary, and the change log
        # Parse structured JSON response and return score == "PASS"
        ...

Known Biases

Note that there are some known biases of LLM-as-a-Judge as studied in the MT-Bench paper:

Building and Calibrating a Judge

The process of building and calibrating a Judge requires a judge calibration dataset: (agent_output, human_quality_label) pairs where a human expert has rated each output as pass or fail with reasoning.

From a dataset of ~50-100 examples, the split looks like: 3-8 carefully selected examples as few-shot anchors in the judge prompt (both pass and fail, with critiques), roughly half as a dev set for iterating on the rubric, and the rest held out as a test set for final validation. The rubric text does the heavy lifting. Few-shot examples are calibration anchors, not training data.

Measure the judge with True Positive Rate (does the judge catch real failures?) and True Negative Rate (does it avoid flagging correct outputs?). Compare these against human labels on the held-out test set.

Humans (label small samples, high quality)
  → calibrate LLM Judge (validate TPR/TNR against human labels)
    → LLM Judge evaluates Agent (at scale)
      → Agent serves users

Why this works despite LLMs being imperfect: judging is fundamentally easier than generating. An LLM that hallucinates during open-ended generation can often reliably do scoped binary classification (“is this output correct given these criteria?”). The MT-Bench paper showed GPT-4 achieves >80% agreement with human preferences, comparable to human-human agreement.

Can you use the same model for both agent and judge? From what I gathered, usually yes. Self-preference bias exists but is often small enough not to matter. What matters more is empirical alignment with humans on your specific task.

The alignment loop: Run judge on dev set, compare to human labels. Where the judge disagrees, examine each disagreement: is the rubric ambiguous, or is this a genuinely hard edge case? Refine the rubric for ambiguities. Document edge cases rather than over-fitting. Validate on the test set. Re-calibrate periodically (every 1-2 months) with fresh production samples.

From Prototype to Production

The maturity lifecycle gives a rough roadmap. Two prerequisites should be in place before even thinking about formal evals:

Prerequisite: Observability. Every source I read emphasized this: add tracing and logging from the very first prototype. Capture inputs, tool calls, intermediate reasoning, and outputs. You cannot evaluate what you cannot observe, and meaningful error analysis is impossible without traces. This is the foundation everything else builds on.

Prerequisite: Deterministic checks as you build. I found it helpful to think of deterministic graders as unit tests for agents. Just as I would not write a function without a test, I would not add agent capabilities without a corresponding check. These checks accumulate into the eval suite naturally. The point is not “adding evals later”; it is building them alongside the agent.

Whatever I am already testing manually during prototyping should go directly into the eval dataset. If I am copy-pasting a CSV into the agent and eyeballing the output, that CSV and my mental “looks correct” criteria are the first eval case. Formalizing them early creates a regression safety net as the agent evolves.

With that foundation in place, the maturity lifecycle for the cleaning agent might look like:

Prototype: Start with ~25 cases, seeded from whatever you have been testing by hand. Read every transcript manually. Deterministic graders only. Iterate rapidly on prompts and tool descriptions. Fix the most common failure modes first. Promote stable cases to regression evals as they consistently pass.

MVP: Add regression evals for things that work. Introduce the LLM judge for summary quality. Calibrate with a small dataset. Grow the eval dataset to 80-100 cases via synthetic generation. Start cost tracking. Run a second round of error analysis to find remaining edge cases. Add adversarial inputs.

Production: Watch for eval saturation (retire easy evals or increase difficulty). Add adversarial tasks. Production monitoring with sampled evaluation. Re-calibrate judges every 1-2 months. Measure with pass@k (at least one of k attempts correct) for capability. If reliability matters (and in production it usually does), track pass^k (all k attempts must succeed) instead. pass@k tells you what the agent can do; pass^k (a practitioner framing, not a formally standardized metric) tells you what it will do consistently.

For CI/CD: run fast deterministic evals on every commit (25 cases, sub-minute feedback), and comprehensive evals including LLM judges on PRs or nightly builds. In production, randomly sample traces (5%), run LLM judges on sampled traces to detect quality drift, and run the full error analysis workflow on fresh data every few weeks.

Observability vs. Evaluation: These are complementary. Observability answers “what happened?” (traces, latency, errors). Evaluation answers “was it good?” (quality, correctness, safety). Observability provides the data; evaluation provides the judgment.

A note on pairwise evaluation. Everything above uses absolute evaluation (comparing agent output against a reference). For A/B testing model versions or prompt variations, pairwise (relative) evaluation is often more reliable: show two outputs side by side and ask “which is better?” Both humans and LLM judges find relative comparison easier than assigning absolute scores. Most practitioners I read about use absolute reference-based evaluation day-to-day and pairwise comparison only for A/B tests.

Frameworks and Tooling

For reference, here is a comparison of frameworks relevant to building this:

Eval Frameworks:

FrameworkApproachBest for
pydantic-evalsCode-first Python, Dataset/Case/Evaluator, OpenTelemetry spansPython-native agent eval with trajectory analysis
OpenAI Evals PlatformHosted infrastructure, trace gradingOpenAI ecosystem, rapid iteration
PromptfooYAML-config-driven, CI/CD native, red-team testingCI/CD integration, security testing
BraintrustExperiment tracking, judge calibrationExperiment-driven development, A/B testing
DeepEvalComprehensive eval framework, 50+ eval metrics, pytest-like unit-testingAll in one framework with CI/CD, integrated Observability and monitoring

Observability and Monitoring:

FrameworkApproachBest for
Arize PhoenixProduction monitoring, online evaluationsOSS, self-hosted, observability + evaluation
Pydantic LogfireOpenTelemetry-native, pydantic-ai integrationPython/pydantic-ai ecosystem
LangfuseOpen-source LLM observability, tracingSelf-hosted, multi-framework support
LangSmithChain debugging, annotation queuesLangChain ecosystem

Obviously, there are existing observability and monitoring tools like datadog or the whole grafana stack. I saw datadog has a new LLM Observability feature set. If you are evaluating any such tools/platform, look if it supports the new GenAI semantic conventions

Personal Notes

References

Primary Sources

Academic Papers

Supplementary