nanoevals: What goes into LLM evals
[ llm evals ai-agents python ] · 7 min read

In my previous article, I tried to compile the theories around LLM and agent evaluations. It gave me a good understanding of the process but there were still a few open questions and/or concerns around what it looks like in practice.

So as a follow up, I wanted to build a minimal reference implementation of an end to end eval process to understand the core parts of evals a bit more in depth. nanoevals is my attempt at that.

It’s small and minimal yet covers (almost) all the moving parts under ~700 lines of python code (core logic is ~300 lines, the rest is cli and streamlit app wiring).

This article is about the implementation, design choices, what I left out and why.

What purpose it serves

Well, the primary purpose it serves is to provide clarity on what exactly goes into the evaluation process and how all the individual parts fit together. Stripping everything to only the essentials gives a clear lens into what’s actually load bearing and what’s just implementation detail.

Constraints

Before I started, I set a few constraints for myself:

Initially my plan was to stay under 500 lines. I failed, but with good reasons (see the Entrypoints section below).

In terms of external dependencies, I only used pydantic and pyyaml in the core logic. I think it’s fair. pydantic gives a few QoL improvements in terms of managing types. And pyyaml provides easier handling of YAML files (used for managing golden datasets). Without it, I had to manually handle file loading and/or parsing, which I didn’t want to do.

One more thing that I didn’t plan initially but later adopted was a streamlit app as an additional entrypoint alongside CLI. As much as I like CLIs, I think the streamlit app serves a critical need: dataset management and report overview. More on this in the Entrypoints section.

Architecture

If you recall the previous article, there are three layers of evaluation:

nanoevals covers all three. The architecture is conceptually very simple:

nanoevals architecture overview
Figure 1: Architecture Overview

Design choices

Entrypoints

nanoevals provides two entrypoints, a CLI and a streamlit app. Both call the runner and show the results in a report-like manner.

nanoevals entrypoints
Figure 2: Entrypoints

My initial thoughts and implementation was only about CLI. It serves the purpose and easy to run in CI. But a critical part of the eval process is dataset management, especially by non-technical SMEs/stakeholders. Streamlit shines in this case. It provides a visual interface to view reports and manage datasets. I think it’s pretty cool.

nanoevals streamlit dataset editor
Figure 3: Streamlit dataset editor
nanoevals streamlit reports
Figure 4: Streamlit reports

Although this comes with a cost. The streamlit app is the most heavy part of the codebase with ~190 LoC. Without it, I would’ve maintained my initial constraints of ~500 LoC. But I think it provides significant values to ignore the constraint.

That being said, the streamlit implementation is very rough. But again, it’s for illustration and it’s not meant to be production ready.

Judge calibration

The CLI provides a judge calibration endpoint. It computes TPR and TNR and provides an overview of the judge’s efficacy. To be fair, this was an afterthought, This could seriously be improved.

Currently, the whole thing revolves around agent’s evaluation and not so much about judge calibration. But the implementation does have the necessary data structures for judge datasets and it’s easily extensible I think.

Contract Signatures

Provide your agent, judge, and custom metrics as functions:

def my_agent(input: str) -> Trace:
    ...

def my_judge(trace: Trace, test_case: AgentTestCase) -> list[EvalResult]:
    ...

def my_metric(trace: Trace, test_case: AgentTestCase) -> EvalResult:
    ...

dataset = load_agent_dataset("my_tests.yaml")
report = run_eval(
    dataset,
    agent_fn=my_agent,
    judge_fn=my_judge,
    extra_metrics=[my_metric],
)

Async agents are supported transparently — just pass an async def agent and run_eval handles it automatically, running test cases concurrently with asyncio.gather:

async def my_agent(input: str) -> Trace:
    result = await call_llm(input)
    return Trace(output=result, ...)

How to read the codebase

Core modules:

Supporting modules:

Limitations

There are some serious limitations, but I have deliberately chosen to leave these out. Given the goal is to optimize for clarity over completeness, I think it’s an acceptable tradeoff.

So in short, please don’t use it in production :P It’s not supposed to be a library you adopt, rather you try it out and read the code to understand the moving parts.

Personal Notes

Personally I learned a lot with this exercise. A lot of the open questions and concerns in the previous article are now clearer.

I mentioned in the previous article that structured output should help with evaluation. This proves that, if the agent produces structured output, you can easily incorporate those in the dataset and check against each field instead of fuzzy matching (I did include a fuzzy reference matching metric, which is kind of proving the point by negation that it’s harder to test against prose)

Question around running evals in CI is much clearer, but the question of running them cheaply still remains. As calling the agent/LLM is still a requirement. Need to figure this out.

The big revelation was of course dataset management with YAML. I think it’s pretty neat.

Obviously there are more to learn. But I am making good progress :D