The rise of AI evals

As AI products become mainstream, one skill is emerging as critical for product teams: writing AI evals (evaluations). Aman Khan, Director of Product at Arize AI, believes evals will be the most important skill for product teams moving forward.

What are AI evaluations?

Think of evals as the new version of A/B testing for AI systems. Traditional software runs on deterministic logic (“if this, then that”), while AI systems are statistical and probabilistic. That shift means we need new ways to measure whether our AI products are actually working.

“A product is a box in the middle with an input and output. The input could be a customer trying to solve a problem, and the output is the solution. Evals are how you measure how good or bad that box in the middle — the product — actually is.”

In practice, evals test things like:

Correctness: is the AI following the rules?
Hallucination detection: is it making things up?
Tone and sentiment: does the response match your brand voice?
Tool usage: is it using the right APIs or functions?

Why evals matter now

Traditional software was predictable. You could check if an API was called correctly or if code executed as expected. AI systems on the other hand introduce variability and subjectivity that make evaluation far more complex.

Take a customer support AI agent as an example. If a customer says, “I want to return my laptop” the AI needs to:

Check if the return is within the window
Verify product condition requirements
Follow company policies
Keep the right tone

Any of these steps could go wrong in subtle ways that normal testing won’t catch.

The challenge: why evals are hard to write

Good evals force product teams to empathize with users at a much deeper level than traditional product work.

“Evals actually force you to get into the shoes of your user. You can no longer just hypothesize ‘I wonder if they’ll do this’ — you have to articulate your understanding of the user into text.”

That’s difficult because of a few factors:

High subjectivity: what counts as a “good” AI response depends on context
Non-determinism: the same input might give different outputs
Complex edge cases: users interact with AI in unpredictable ways
Modeling human behavior: you’re essentially testing human-like responses

How to write effective evals

Start with real data
Use actual user interactions and human-labeled responses, like past support conversations, company policies, and examples of good and bad answers.
Break down your system
Don’t test the whole system at once. Treat each piece like a unit test: retrieval, reasoning, output generation, and tool usage.
Create judge LLMs
Use AI to evaluate AI. Write prompts that check correctness with constrained answers like “CORRECT” or “INCORRECT” instead of long explanations.
Add examples to your prompts
Show both correct and incorrect answers to improve the accuracy of your judge models.
Generate synthetic data
Once you have real examples, use AI to create variations and edge cases to expand your test set.
Iterate based on performance
Compare eval results with real-world A/B tests and refine your criteria over time.

Practical tips to get started

Start simple: check for correctness before worrying about cost or speed
Use existing tools: open-source options like Arize’s Phoenix can save you from building infra yourself
Keep it fast: use multi-threading and efficient APIs so evals don’t slow you down
Constrain outputs: make sure judge LLMs return structured answers, not essays
Work in loops: treat evaluation as an iterative process that improves prompts, retrieval, and models over time

“You can’t just ship with AI — you have to be able to measure what your system is doing,” Khan says. The old way of writing PRDs and handing them off is fading. In AI, product managers need tight iteration loops, constantly prototyping and evaluating.

Getting started today

The good news is you don’t need a PhD in machine learning to write evals.

“Try the tools yourself, join the right communities and be around people building AI products.”

Pick a simple use case in your product. Write basic evals for correctness and tone. Use the feedback to improve your system. And remember, AI adoption is still early, and there’s plenty of room to shape how this practice evolves.

As AI becomes as common as databases in software, the product teams who master evaluation will be the ones building products that truly work for users.

Source