Back to Guides

An Introduction to Evals

Evaluations test model and agent outputs to ensure they meet the standards and requirements you specify.
Last updated October 7, 2025
AI

Evaluations (or "evals") are systematic tests that measure how well AI models perform at specific tasks. Like traditional unit, integration, and end-to-end tests, evals ensure your code remains reliable and stable. However, they differ in one substantial way: the underlying system being tested is non-deterministic due to the fact that outputs can vary slightly or significantly between runs. Evals are designed specifically to test systems robustly when outputs aren't perfectly consistent.

This guide will walk you through the fundamentals of building and implementing effective evaluation frameworks for AI applications.

Evals complement your existing test suite.
Evals complement your existing test suite.
Evals complement your existing test suite.
Evals complement your existing test suite.

For many AI apps, developers run a few examples and check if the outputs "feel right" before shipping. It's all about the "vibes." Unfortunately, this vibe-based approach doesn't scale, and when deploying AI models in production, you need confidence that they'll perform consistently and safely. Evals help you:

Compare models and prompts. Should you use GPT-5 or Claude? Will adding examples to your prompt improve accuracy? Without evals, these decisions rely on guesswork and anecdotal evidence. With a solid evaluation framework, you can run the same test suite across different configurations and make data-driven decisions about which approach actually performs better for your specific use case.

Catch regressions early. AI systems are constantly evolving. You might update your prompt, switch models, or modify your system architecture. Each change risks breaking functionality that previously worked. Evals act as a safety net, automatically flagging when changes degrade performance on known test cases, preventing you from shipping regressions to production.

Identify edge cases. Real-world usage always surfaces scenarios you didn't anticipate. A comprehensive eval suite helps you discover where your system breaks down, whether it's ambiguous inputs, unusual formatting, edge cases in your domain. Once identified, these failures become test cases, ensuring you don't regress as you improve the system.

Every eval system consists of three main building blocks: the dataset, the runner, and the scorer.

The dataset is your collection of test cases. This includes the inputs and expected outputs that define what success looks like. A well-constructed dataset is the foundation of an effective eval system.

Your dataset should mirror real-world usage. Start with actual production examples, edge cases you've discovered, and failure modes you want to prevent. Quality beats quantity. 20 well-chosen test cases that cover your core use cases are more valuable than 200 random examples.

The runner (also called a "harness" or "executor") is the orchestration layer that executes your test cases. It feeds inputs to your AI system, collects outputs, and manages the evaluation workflow.

A good runner is model-agnostic, meaning you can swap out GPT-4 for Claude or change your prompt without rewriting your entire eval system. This flexibility is key for comparing different approaches.

The scorer (or "grader") evaluates how well the actual outputs match expected results. This is where evals diverge most significantly from traditional testing. Instead of exact matches, you need methods that account for the variability in AI outputs.

Common approaches include exact matching (for structured outputs like JSON), semantic similarity (checking if outputs mean the same thing even with different wording), LLM-as-judge (using another AI to evaluate quality), and custom metrics tailored to your domain.

Given an AI agent that classifies customer support emails as “billing,” “technical,” or “general,” let’s walk through the building of an eval system.

We start with a dataset that defines what good performance looks like. Each test case includes an input, expected output, and metadata.

const dataset = [
{
input: "I was charged twice for my subscription this month.",
expected: "billing",
difficulty: "easy"
},
{
input: "The app crashes when I export to CSV.",
expected: "technical",
difficulty: "easy"
},
{
input: "Love your product! Thanks for the great service.",
expected: "general",
difficulty: "easy"
},
{
input: "Payment failed but I see a pending charge. Is this a bug?",
expected: "billing",
difficulty: "hard"
}
];

Notice the mix of straightforward cases and edge cases. The hard example mentions both payment issues and potential bugs, exactly the kind of ambiguity you'll encounter in production.

After wiring up the dataset, the runner executes your test cases and collects results. Here's the core logic:

import { generateObject } from "ai";
import { z } from "zod";
const ClassificationSchema = z.object({
category: z.enum(["billing", "technical", "general"]),
confidence: z.number().min(0).max(1)
});
interface ClassifyEmailInput {
model: string;
email: string;
}
export async function classifyEmail({ model, email }: ClassifyEmailInput) {
const { object } = await generateObject({
model,
schema: ClassificationSchema,
prompt: `Classify the following customer support email into one of three categories: "billing", "technical", or "general".
Return a JSON object with both "category" and "confidence" (0.0–1.0).
Email:
${email}
`
});
return object;
}
async function runEval(modelId: string) {
  const results = [];
  for (const testCase of dataset) {
    const { category, confidence } = await classifyEmail({
      model: modelId,
      email: testCase.input
    });
    results.push({
      input: testCase.input,
      expected: testCase.expected,
      actual: category,
      confidence: confidence,
      passed: category === testCase.expected
    });
  }
  return results;
}

In the above code snipped we leverage the AI SDK to keep the runner as model-agnostic, letting you swap out modelId to test GPT-5, Claude, or any other model without changing your evaluation logic.

Putting it all together, you can now compare different models systematically via scorer:

interface EvalResult {
input: string;
expected: string;
actual: string;
confidence: number;
passed: boolean;
}
interface Metrics {
accuracy: number;
}
export function calculateMetrics(results: EvalResult[]): Metrics {
const total = results.length;
const passed = results.filter(r => r.passed).length;
const accuracy = total > 0 ? passed / total : 0;
return { accuracy };
}
const gpt5Results = await runEval('openai/gpt-5');
const gpt5Metrics = calculateMetrics(gpt5Results);
const claudeResults = await runEval('anthropic/claude-sonnet-4.5');
const claudeMetrics = calculateMetrics(claudeResults);
console.log('GPT-5 Accuracy:', gpt5Metrics.accuracy);     
console.log('Claude Accuracy:', claudeMetrics.accuracy);  

In this example, the calculateMetrics function leverages exact matching to grade the output from the runner.

With this foundation, you can systematically improve your AI system—measuring every change, catching regressions early, and making decisions based on evidence rather than intuition.

  • Use the AI SDK to write eval code once and test across multiple models by simply changing the model identifier
  • Leverage Vercel AI Gateway for automatic caching, built-in observability, and provider fallbacks when running eval suites
  • Implement monitoring with tools like Braintrust with automatic scoring and experiment tracking
  • Learn from Eval-Driven Development for real-world patterns on integrating evals into your development process