I have been building a few agent prototypes and hit a wall that I think is common: beyond a certain point, reliability just falls apart. LLMs fail silently. Traditional software throws an exception when something goes wrong; an LLM produces a plausible-sounding but wrong answer, and nobody notices until real damage is done. With newer models, frameworks, and techniques shipping every week, it is hard to know whether a change is actually better or just different. I needed a systematic way to measure quality. That is what evals are.
I spent a few weeks going deep on evaluation practices for AI agents. I read everything I could find: Anthropic’s engineering blog, Hamel Husain’s practitioner-focused guides, the Evals for AI Engineers book by Shreya Shankar and Hamel Husain, the MT-Bench paper, and several eval framework docs. This post is my synthesis of what actually matters, what the common traps are, and how to think about evaluation systematically. To keep things concrete, I work through a design exercise applying these ideas to a hypothetical data cleaning agent. (Terminology varies across frameworks; see the quick reference at the end if you run into unfamiliar names.)
TL;DR:
- Observability first. You cannot evaluate what you cannot observe. Add tracing from day one.
- Deterministic checks are unit tests for agents. Write them as you build, not after.
- Three-layer eval stack: what you evaluate (the agent), how you grade (deterministic checks, then LLM-as-a-Judge, then humans), and what grounds it all (datasets).
- Error analysis is the development methodology. Not testing. The agent, judge, and datasets co-evolve through the same loop.
- Start with 20-50 cases focused on known failures. Use binary pass/fail over Likert scales. Never stop reading traces.
Why Evals Matter
Without evals, every change to a prompt, model, or retrieval pipeline could improve one thing and quietly break three others. I was flying blind, relying on vague intuitions instead of concrete measurements.
Evals create a flywheel of improvement:
- Analyze - examine data to find where the system fails
- Measure - build targeted evaluators that quantify those failures
- Improve - experiment with changes (prompts, models, retrieval strategies)
- Automate - turn confirmed fixes into regression tests that prevent backsliding
Each rotation makes your system more reliable. The eval loop is the development loop.
Design Exercise: Evaluating a Data Cleaning Agent
To make this synthesis concrete, I will work through how I would design an eval suite for a specific agent, applying the principles from the sources above. None of this has been run end-to-end. It is a design exercise, not a case study. The specific numbers are illustrative; the architecture is what matters.
The agent: accepts a messy CSV or Excel file plus a natural-language cleaning spec (e.g., “standardize date columns to ISO-8601, remove duplicates on order_id, fill missing country from city lookup”). It outputs a cleaned file plus a structured change log.
| Property | Value |
|---|---|
| Tools | read_file, write_file, pandas_transform, column_stats, preview_rows |
| Output | Cleaned CSV/Excel + JSON change log |
| Grading balance | ~80% deterministic code graders, ~20% LLM judge |
| Framework | pydantic-evals |
The Three-Layer Evaluation Stack
There are three distinct layers in any evaluation system. Keeping them separate is critical because conflating them is where most of the confusion in this space comes from. Use the same dataset to measure both “is my agent good?” and “is my judge trustworthy?” and it becomes impossible to tell which one is broken when scores drop.
Layer 1: What You’re Evaluating (The Agent)
The agent is the system under test. The sources I read consistently break agent behavior into several dimensions: output quality (is the final answer correct?), tool use (right tools, right parameters, no hallucinated calls?), trajectory (sensible path to the answer?), interaction (multi-turn coherence, error recovery), and cost (tokens, latency, dollars).
There are also cross-cutting behavioral dimensions worth tracking: planning, self-reflection, and memory across turns.
One important nuance on trajectory evaluation. There is a natural instinct to check that agents followed very specific steps. Anthropic warns against this: it is too rigid. The better approach (and the consensus across everything I read) is to grade what the agent produced (postconditions), not the exact path it took. Reserve trajectory checks for safety invariants (forbidden tools) and critical ordering constraints (e.g., auth before write). For everything else, trajectory is informational.
For the data cleaning agent, the dimension mapping looks like this:
| Dimension | What it means here | Grading approach |
|---|---|---|
| Output quality | Is the cleaned file correct? All specified transforms applied? | Deterministic (row counts, schema checks, cell-level diffs) |
| Tool use | Did it call pandas_transform with valid operations? No hallucinated column names? | Deterministic (tool call log validation) |
| Trajectory | Did it preview data before transforming? Avoid unnecessary passes? | Informational (step count, checked via code) |
| Cost | Tokens, latency, number of tool calls | Deterministic (budget thresholds) |
Evals also serve two different purposes: capability evals (“does this agent do X well?”, with scores that start low and climb) and regression evals (“does it still do X as well as before?”, which should stay at 100%). When a capability eval consistently hits near-perfect scores, promote it to a regression eval or increase difficulty. For the cleaning agent, handling exotic encodings would be a capability eval; basic dedup would be a regression eval.
Layer 2: How You’re Grading (The Grading Pipeline)
Three types of graders, in priority order. The consistent advice I found is to use the cheapest reliable option first:
- Deterministic graders (code-based) - use wherever possible. String matching, schema validation, tool call verification, pass/fail tests. Cheap, fast, reproducible, CI-friendly. If you can write a code check for it, do not use an LLM judge.
- LLM-as-Judge (model-graded) - for subjective dimensions only. Coherence, helpfulness, tone, reasoning quality. More on this below.
- Human graders - for calibration, edge cases, and validating the other two. Expensive but essential as the root of trust.
For the cleaning agent, about 80% of the grading is deterministic. Here are the core graders:
Note: The pydantic-evals API is simplified here for illustration. The exact span attribute keys (e.g.,
gen_ai.tool.name) depend on what your agent framework records via OpenTelemetry. Check the pydantic-evals docs for the current API.
| |
Additional graders would cover spot checks on specific cell values, null counts, and change log verification. Eight deterministic graders total.
The remaining ~20% needs an LLM judge: specifically, evaluating whether the human-readable summary in the change log is accurate, complete, and clear. That is the one dimension that resists reduction to code. More on building and calibrating this judge in the LLM-as-a-Judge section.
For scoring, deterministic graders act as binary gates (all must pass). The LLM judge contributes a weighted score:
def compute_final_score(deterministic_results: list[bool],
summary_judge_pass: bool) -> dict:
if not all(deterministic_results):
return {"pass": False, "reason": "deterministic_failure"}
return {
"pass": True,
"summary_quality": "PASS" if summary_judge_pass else "FAIL",
"overall_score": 1.0 if summary_judge_pass else 0.8,
}
Layer 3: What Grounds Everything (Datasets)
Two distinct datasets serve two distinct purposes:
| Dataset | Structure | Purpose |
|---|---|---|
| Agent eval dataset | (input, expected_output) pairs | Measures agent quality |
| Judge calibration dataset | (agent_output, human_quality_label) pairs | Measures judge accuracy |
These are different datasets. The agent eval dataset answers “is my agent good?” The judge calibration dataset answers “is my judge trustworthy?” Conflating them was one of the most common sources of confusion I encountered in the literature.
The expected_output is not limited to exact string matches. It can be structured data for multi-axis scoring, a natural-language reference for an LLM judge, or a verification function.
For the cleaning agent, I would start with ~25 hand-curated (input, expected_output) pairs. Each input is a messy file plus a cleaning instruction. Each expected output is the cleaned file content plus the expected change log:
# Case 1: Basic dedup + date normalization
- name: "dedup_and_dates"
inputs:
file: "fixtures/orders_messy_001.csv"
instruction: >
Remove duplicate rows based on order_id (keep first occurrence).
Standardize the order_date column to ISO-8601 format (YYYY-MM-DD).
expected_output:
row_count_range: [140, 150]
columns: ["order_id", "customer", "order_date", "amount"]
change_log:
rows_removed: 8
cells_modified: 23
spot_checks:
- row: 0
order_date: "2024-01-15"
summary_reference: >
The agent should mention removing duplicate orders with a specific
count and converting date cells from mixed formats (MM/DD/YYYY,
DD-Mon-YY) to ISO-8601.
tags: ["dedup", "date-normalization", "basic"]
# Case 2: Null filling with lookup
- name: "null_fill_country_from_city"
inputs:
file: "fixtures/customers_missing_country.csv"
instruction: >
Fill missing values in the 'country' column by looking up the city.
If the city is not recognizable, set country to 'UNKNOWN'.
expected_output:
row_count: 500
null_counts:
country: 0
change_log:
cells_modified: 47
cells_set_unknown: 3
spot_checks:
- row: 12
city: "Mumbai"
country: "India"
- row: 388
city: "Xyzzyville"
country: "UNKNOWN"
tags: ["null-fill", "lookup", "medium"]
# Case 3: Negative case — agent should WARN, not crash
- name: "missing_required_column"
inputs:
file: "fixtures/transactions_no_id.csv"
instruction: >
Deduplicate on transaction_id and normalize amounts to 2 decimal places.
expected_output:
should_warn: true
warning_contains: "transaction_id"
change_log:
error: "Column 'transaction_id' not found in input file"
tags: ["negative-case", "schema-validation", "warning"]
A few design principles I took away: Include both positive and negative cases. Include a reference solution that passes all graders to verify the pipeline itself (a 0% pass rate usually means the graders are broken, not the agent). Tag cases for filtering so you can run only “basic” cases on every commit and the full suite on PRs.
LLM-as-a-Judge
This is the part of the evaluation stack I found most interesting and most nuanced, so I want to go deeper here.
The consensus across practitioners is clear: use LLM judges only for qualities that resist reduction to code checks. They should not be used for:
- Format/schema validation or length constraints (use code)
- Exact match requirements or regex-matchable patterns (use code)
- Code correctness (use unit tests)
- Factual verification without reference context (the judge will hallucinate evaluations)
- Specialized domains (medicine, law, finance) without grounding material
- Cases requiring strict reproducibility (the judge itself is non-deterministic)
Rubrics and Dimensions
A “rubric” is the instruction text given to the LLM judge defining how to assess a dimension. Dimension = what you grade. Rubric = the grading criteria sheet. For example:
- Dimension: “helpfulness”
- Rubric: “The response must: (1) directly address the user’s question, (2) provide at least one actionable next step, (3) not require follow-up questions for basic info. PASS if all three are met. FAIL if any is not met.”
Key design principle from the sources: Use pass/fail judgments over point ratings. Likert scales (1-5) produce inconsistent, unactionable results. Instead of one judge rating “quality: 3/5”, decompose into multiple focused binary judges, each with their own rubric (accuracy: PASS/FAIL, groundedness: PASS/FAIL, completeness: PASS/FAIL). Far more actionable. The judge model should be at least as capable as the model being evaluated and should return structured output (JSON with score, reasoning, citations).
For the cleaning agent, the summary quality judge looks like this:
| |
Known Biases
From the MT-Bench paper:
| Bias | Mitigation |
|---|---|
| Position bias (favors first/second response) | Call twice with positions swapped |
| Verbosity bias (favors longer responses) | Include length-neutrality in rubric |
| Self-enhancement (favors same model’s output; effect size is often small) | Use a different model family as judge |
| Limited reasoning (validates wrong math/logic) | Chain-of-thought, reference-guided grading mitigates |
Building and Calibrating a Judge
This is where it gets concrete. The process requires a judge calibration dataset: (agent_output, human_quality_label) pairs where a human expert has rated each output as pass or fail with reasoning.
From a dataset of ~50-100 examples, the split looks like: 3-8 carefully selected examples as few-shot anchors in the judge prompt (both pass and fail, with critiques), roughly half as a dev set for iterating on the rubric, and the rest held out as a test set for final validation. The rubric text does the heavy lifting. Few-shot examples are calibration anchors, not training data.
Measure the judge with True Positive Rate (does the judge catch real failures?) and True Negative Rate (does it avoid flagging correct outputs?). Compare these against human labels on the held-out test set.
This is what breaks the circular dependency of “using LLMs to evaluate LLMs,” and it was one of the things I found most clarifying. There is no need for another LLM to evaluate the judge. Humans are the root of trust, but only at a manageable scale (dozens to low hundreds of examples). The judge then scales that human judgment to thousands of evaluations.
Humans (label small samples, high quality)
→ calibrate LLM Judge (validate TPR/TNR against human labels)
→ LLM Judge evaluates Agent (at scale)
→ Agent serves users
Why this works despite LLMs being imperfect: judging is fundamentally easier than generating. An LLM that hallucinates during open-ended generation can often reliably do scoped binary classification (“is this output correct given these criteria?”). The MT-Bench paper showed GPT-4 achieves >80% agreement with human preferences, comparable to human-human agreement.
Can you use the same model for both agent and judge? From what I gathered, usually yes. Self-preference bias exists but is often small enough not to matter. What matters more is empirical alignment with humans on your specific task.
To illustrate the shape of a calibration dataset, here is what entries for the cleaning agent’s summary judge might look like. The key: clear PASS and FAIL examples with human reasoning that explains the judgment:
# Calibration entry: PASS — summary matches change log with specific counts
- agent_output:
summary: "Removed N duplicate orders and converted M date cells to ISO format."
change_log: {rows_removed: N, cells_modified: M}
human_label: PASS
human_reasoning: "Summary accurately reflects change log. Clear and concise."
# Calibration entry: FAIL — too vague, no counts
- agent_output:
summary: "Cleaned the data by removing duplicates and fixing dates."
change_log: {rows_removed: N, cells_modified: M}
human_label: FAIL
human_reasoning: "Too vague — no specific counts. 'Fixing dates' doesn't specify format."
# Calibration entry: FAIL — factual mismatch with change log
- agent_output:
summary: "Removed X duplicate orders and normalized all dates."
change_log: {rows_removed: Y, cells_modified: M}
human_label: FAIL
human_reasoning: "Claims X duplicates but change log says Y. Factual inaccuracy."
The iteration process would look like this: run the judge on the dev set, measure TPR and TNR against human labels, refine the rubric where the judge disagrees with humans. From what I read, each round of refinement tends to improve one metric at the cost of the other. Adding a specificity criterion (vague counts = FAIL) might improve TPR but slightly reduce TNR. The goal is a rubric that is stable across both metrics, not one that maximizes either.
The alignment loop: Run judge on dev set, compare to human labels. Where the judge disagrees, examine each disagreement: is the rubric ambiguous, or is this a genuinely hard edge case? Refine the rubric for ambiguities. Document edge cases rather than over-fitting. Validate on the test set. Re-calibrate periodically (every 1-2 months) with fresh production samples.
Error Analysis: The Core Development Loop
This was the biggest reframing for me. Error analysis is not a dataset-building technique. It is the central development methodology for AI agents. Everything else (datasets, judges, prompts) grows out of this loop.
The Workflow
1. Collect traces (~100, both successes and failures)
2. Annotate (human reads each trace, notes failures — no LLMs here)
3. Cluster (group annotations into failure modes)
4. Act (fix agent, refine judge, expand datasets)
5. Verify (run evals to confirm fixes)
→ Repeat
Collect: Each trace should capture the input, all tool calls, intermediate reasoning, and final output. Observability infrastructure is a prerequisite; without traces, meaningful error analysis is impossible.
Annotate: A human SME reads each trace carefully, writing brief notes about anything surprising, incorrect, or wrong-feeling. No LLMs in this step. The goal is to find patterns, not just individual failures.
Cluster: Group similar annotations into coherent failure modes (e.g., “hallucinated tool calls”, “lost context after 5 turns”, “wrong tool for date queries”). LLMs can help with initial clustering, but humans must validate. Typically 2-3 rounds to stabilize the taxonomy.
Act: For each failure mode, triage: is this an agent problem, a judge problem, or a dataset gap?
For the cleaning agent, a hypothetical failure taxonomy after ~100 traces might look like this (illustrative figures):
| Cluster | Count | Root Cause | Triage |
|---|---|---|---|
| Column hallucination | 18 | Agent does not check column_stats before transforming | Agent problem |
| Incomplete multi-step | 12 | Agent loses track after 3+ operations | Agent problem |
| Silent type error | 9 | pandas_transform returns success without type validation | Agent + Judge |
| Overcleaning | 7 | Ambiguous instruction interpretation | Dataset gap |
| Summary drift | 5 | Summary generated mid-process, not updated after | Judge problem |
Actions per cluster:
- Column hallucination (agent): Add hard constraint in system prompt: “ALWAYS call
column_statsbefore anypandas_transform.” Add a trajectory grader enforcing this ordering. - Incomplete multi-step (agent): Add planning step: “For multi-step instructions, first list all operations, then execute in order.”
- Silent type error (agent + judge): Add post-transform type validation and a
ColumnTypeEvaluatorgrader. - Overcleaning (dataset): Add 3 new negative cases where blank optional fields should NOT trigger row deletion.
- Summary drift (judge): Refine the LLM judge rubric to explicitly cross-check summary claims against the structured change log.
Every failure discovered also feeds back into the agent eval dataset as a regression test. Error analysis feeds all three components simultaneously. It is not circular; it is a spiral where each pass improves the agent, the judge, and the datasets together.
When multiple people are annotating: Draft working definitions of failure/success with pass/fail examples. Have each annotator label a common set of 20-50 traces independently, then measure inter-annotator agreement (Cohen’s kappa). Low agreement signals ambiguous rubrics. Run alignment sessions to discuss disagreements and clarify. One strong domain expert matters more than a committee. Find the principal domain expert whose judgment drives decisions.
From Prototype to Production
The maturity lifecycle gives a rough roadmap. Two prerequisites should be in place before even thinking about formal evals:
Prerequisite: Observability. Every source I read emphasized this: add tracing and logging from the very first prototype. Capture inputs, tool calls, intermediate reasoning, and outputs. You cannot evaluate what you cannot observe, and meaningful error analysis is impossible without traces. This is the foundation everything else builds on. For the cleaning agent, that means logging every pandas_transform call, its parameters, and the result, not just the final output.
Prerequisite: Deterministic checks as you build. I found it helpful to think of deterministic graders as unit tests for agents. Just as I would not write a function without a test, I would not add agent capabilities without a corresponding check. When I add date normalization to the cleaning agent, I would write a RowCountEvaluator and a spot-check grader at the same time. These checks accumulate into the eval suite naturally. The point is not “adding evals later”; it is building them alongside the agent.
Whatever I am already testing manually during prototyping should go directly into the eval dataset. If I am copy-pasting a CSV into the agent and eyeballing the output, that CSV and my mental “looks correct” criteria are the first eval case. Formalizing them early creates a regression safety net as the agent evolves.
With that foundation in place, the maturity lifecycle for the cleaning agent might look like:
Prototype: Start with ~25 cases, seeded from whatever you have been testing by hand. Read every transcript manually. Deterministic graders only. Iterate rapidly on prompts and tool descriptions. Fix the most common failure modes first. Promote stable cases to regression evals as they consistently pass.
MVP: Add regression evals for things that work. Introduce the LLM judge for summary quality. Calibrate with a small dataset. Grow the eval dataset to 80-100 cases via synthetic generation. Start cost tracking. Run a second round of error analysis to find remaining edge cases. Add adversarial inputs.
Production: Watch for eval saturation (retire easy evals or increase difficulty). Add adversarial tasks. Production monitoring with sampled evaluation. Re-calibrate judges every 1-2 months. Measure with pass@k (at least one of k attempts correct) for capability. If reliability matters (and in production it usually does), track pass^k (all k attempts must succeed) instead. pass@k tells you what the agent can do; pass^k (a practitioner framing, not a formally standardized metric) tells you what it will do consistently.
For CI/CD: run fast deterministic evals on every commit (25 cases, sub-minute feedback), and comprehensive evals including LLM judges on PRs or nightly builds. In production, randomly sample traces (5%), run LLM judges on sampled traces to detect quality drift, and run the full error analysis workflow on fresh data every few weeks.
Observability vs. Evaluation: These are complementary. Observability answers “what happened?” (traces, latency, errors). Evaluation answers “was it good?” (quality, correctness, safety). Observability provides the data; evaluation provides the judgment.
A note on pairwise evaluation. Everything above uses absolute evaluation (comparing agent output against a reference). For A/B testing model versions or prompt variations, pairwise (relative) evaluation is often more reliable: show two outputs side by side and ask “which is better?” Both humans and LLM judges find relative comparison easier than assigning absolute scores. Most practitioners I read about use absolute reference-based evaluation day-to-day and pairwise comparison only for A/B tests.
Frameworks and Tooling
For reference, here is a comparison of frameworks relevant to building this:
Eval Frameworks:
| Framework | Approach | Best for |
|---|---|---|
| pydantic-evals | Code-first Python, Dataset/Case/Evaluator, OpenTelemetry spans | Python-native agent eval with trajectory analysis |
| Inspect AI | Task = Dataset + Solver + Scorer, sandbox environments | Safety evaluation, standardized benchmarks |
| OpenAI Evals Platform | Hosted infrastructure, trace grading | OpenAI ecosystem, rapid iteration |
| Promptfoo | YAML-config-driven, CI/CD native, red-team testing | CI/CD integration, security testing |
| Braintrust | Experiment tracking, judge calibration | Experiment-driven development, A/B testing |
Observability and Monitoring:
| Framework | Approach | Best for |
|---|---|---|
| Arize Phoenix | Production monitoring, online evaluations | OSS, self-hosted, observability + evaluation |
| Pydantic Logfire | OpenTelemetry-native, pydantic-ai integration | Python/pydantic-ai ecosystem |
| Langfuse | Open-source LLM observability, tracing | Self-hosted, multi-framework support |
| LangSmith | Chain debugging, annotation queues | LangChain ecosystem |
From what I gathered, pydantic-evals and Promptfoo are good starting points for eval, depending on whether you prefer code-first or config-first. For observability, Langfuse and Arize stand out as open-source options with broad framework support.
Key Takeaways
These are the principles that came up most consistently across the sources I read. They are not rules I have battle-tested myself (yet), but the reasoning behind each one is compelling:
Observability, deterministic checks, then evals. In that order. You cannot evaluate what you cannot observe, and deterministic graders (your “unit tests for agents”) should be written alongside the agent from day one.
Start early. 20-50 failure-focused tasks is enough. Whatever you are already testing by hand is your first dataset.
Grade outputs first, then trajectories. There are often multiple valid paths to the right answer.
Read transcripts regularly. Scores compress away the details. You will find problems that no automated grader catches.
Binary pass/fail over Likert scales. 1-5 ratings are often a sign of a bad eval process.
Build evals for errors you discover, not errors you imagine.
One domain expert matters more than a committee.
You can never stop looking at data. There is no eval setup that lets you stop reviewing traces.
A 0% pass rate means your eval is broken, not that your agent is incapable.
Personal Notes
Before this deep dive, I thought evals were a “testing phase” you do after building the agent. That framing is backwards. Error analysis and evals are the development methodology. The agent, the judge, and the datasets all co-evolve through the same loop. This reframing was the single most valuable thing I took away.
The insight that judging is fundamentally easier than generating made the whole “using LLMs to evaluate LLMs” approach feel less circular than it sounds. A model that hallucinates freely during generation can still reliably answer “does this output match these criteria?” That distinction makes the whole approach viable.
Hamel Husain’s point about the judge being a “hack” resonated with me. The process of building the judge (examining outputs, writing rubrics, calibrating against human labels) forces you to deeply understand your data. The actual automated judge is almost a side effect. I think this is the most important insight in the whole space.
The error analysis triage step (agent problem vs. judge problem vs. dataset gap) seems like where the real judgment happens. Easy to describe in theory. I suspect it requires significant experience to do well in practice.
I went into this research because my own agent prototypes hit a reliability wall and I had no systematic way to tell whether changes were helping. I was eyeballing outputs and going on vibes. Everything I read confirmed that this is the default state for most people building with LLMs, and that the path forward is not more clever prompting but better measurement. I have not yet applied this full framework to my own agents, but the mental model has already changed how I think about building them.
Terminology Quick Reference
Different frameworks use different names for the same concepts. This tripped me up during research:
| Concept | Anthropic | pydantic-evals | OpenAI |
|---|---|---|---|
| Single test scenario | Task | Case | Test case |
| Collection of tasks | Dataset / Eval Suite | Dataset | Eval |
| One execution attempt | Trial | single run of evaluate | Run |
| Scoring logic | Grader | Evaluator | Grader |
| Execution record | Transcript / Trace | span_tree | Trace |
| Known correct answer | Reference Solution | expected_output | Expected |
| Run infrastructure | Eval Harness | dataset.evaluate_sync() | Evals Platform |
Key differences: Anthropic emphasizes running the same task multiple times (pass@k) and trial isolation. Pydantic-evals runs each case once by default and provides span_tree via OpenTelemetry for trajectory access. Isolation depends on your implementation in most frameworks (Inspect AI being the exception with built-in sandboxes).
References
Primary Sources
- Anthropic: Demystifying Evals for AI Agents - Flywheel concept, grader hierarchy, error analysis methodology
- Anthropic: Claude Testing & Evaluation Docs - Official eval framework, LLM-based grading terminology
- Hamel Husain: Your AI Product Needs Evals - Practitioner’s manifesto; “look at your data” philosophy
- Hamel Husain: LLM-as-a-Judge Complete Guide - Step-by-step judge building, TPR/TNR measurement
- Hamel Husain: LLM Evals FAQ - Same model for judge and agent? Likert vs binary
- Hamel Husain: A Field Guide to AI Agents - Error analysis as the central development loop
- Shreya Shankar & Hamel Husain: Evals for AI Engineers - Book (early release, O’Reilly), comprehensive error analysis methodology
- Lenny’s Newsletter: Evals, Error Analysis, and Better Prompts - Interview with Hamel on the error analysis cycle
Academic Papers
- Zheng et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023) - Foundational LLM-as-Judge paper, GPT-4 >80% human agreement