How to build a durable AI code agent on Vercel

AI agents that generate and run code need three things: durable orchestration that survives failures, isolated execution that protects your infrastructure, and observable model access you can swap without rewriting code. Vercel Workflows, Sandbox, and AI Gateway solve each of these.

This guide covers how to combine all three in a single Next.js app that generates code from a natural language prompt, writes its own tests, runs them in an isolated microVM, and retries automatically when tests fail.

Overview

In this guide, you'll learn how to:

Define a durable, multi-step workflow using the "use workflow" and "use step" compiler directives
Execute AI-generated code safely in a Sandbox microVM
Route LLM calls through AI Gateway with a single API key
Build a self-healing agentic loop that retries on failure
Observe every step, LLM call, and execution result in the Vercel Dashboard

How it works

A user submits a coding task in plain English. A Workflow orchestrates the entire process:

AI Gateway routes a request to Claude to generate code
A second AI Gateway call generates test cases
Sandbox spins up an isolated microVM to run the code and tests.

If the tests fail, the Workflow loops back to the generation step with the error context, and the cycle repeats up to three times. Every step is durable, observable, and automatic.

flow diagram

User prompt
    │
    ▼
POST /api/evaluate  →  start(evaluateCode, [prompt])
    │
    ▼
evaluateCode()          ← "use workflow"
    │
    ├── generateCode()      ← "use step" + AI Gateway
    ├── generateTests()     ← "use step" + AI Gateway
    ├── executeInSandbox()  ← "use step" + Sandbox
    ├── Tests pass?
    │     ├── Yes → return results
    │     └── No  → loop back with error context (max 3 attempts)
    │
    ▼
Results returned

The three primitives each handle a distinct responsibility:

Workflows manages orchestration, retries, and state. If a step crashes, it replays deterministically from the last completed step.
Sandbox handles execution. AI-generated code runs in an ephemeral Firecracker microVM with its own filesystem and network. Nothing touches your server.
AI Gateway handles model access. One API key, any model, with spend tracking and observability built in.

Prerequisites

Before you begin, make sure you have:

A Vercel account
A Vercel project linked to your local repository. AI Gateway and Sandbox both authenticate automatically via OIDC tokens when deployed on Vercel. For local development, vercel env pull pulls the token into your .env.local.
Basic knowledge of TypeScript and Next.js
The Vercel CLI installed

1. Set up the project

Create a new Next.js project and install the dependencies:

terminal

npx create-next-app@latest ai-code-evaluator --typescript --tailwind --app
cd ai-code-evaluator
npm install workflow @vercel/sandbox ai @ai-sdk/gateway

Link your project to Vercel and pull environment variables. Sandbox uses Vercel OIDC tokens to authenticate, so this step is required for local development:

terminal

vercel link
vercel env pull

OIDC tokens expire every 12 hours during local development. If your token expires, run vercel env pull again. For longer local sessions, you can create an AI Gateway API key in the dashboard and add it to .env.local as AI_GATEWAY_API_KEY instead.

2. Configure Workflows

Wrap your next.config.ts with withWorkflow() to enable the "use workflow" and "use step" compiler directives. Without this wrapper, they're string literals with no effect.

next.config.ts

import { withWorkflow } from "workflow/next";
import type { NextConfig } from "next";

const nextConfig: NextConfig = {};

export default withWorkflow(nextConfig);

3. Define the workflow

Create workflows/evaluate-code.ts. This file is the orchestrator: it coordinates the steps, handles the retry loop, and passes error context between iterations.

workflows/evaluate-code.ts

import { generateText } from "ai";
import { gateway } from "@ai-sdk/gateway";
import { Sandbox } from "@vercel/sandbox";

// LLMs often wrap output in markdown code fences even when told not to.
// Strip them so the raw code can be written directly to the sandbox.
function stripCodeFences(text: string): string {
  return text.replace(/^```[\w]*\n?/gm, "").replace(/```\s*$/gm, "").trim();
}

export async function evaluateCode(prompt: string) {
  "use workflow";

  const maxIterations = 3;
  let iteration = 0;
  let lastError: string | null = null;

  while (iteration < maxIterations) {
    iteration++;

    const code = await generateCode(prompt, lastError);
    const tests = await generateTests(prompt, code);
    const result = await executeInSandbox(code, tests);

    if (result.passed) {
      return {
        success: true,
        code,
        tests,
        output: result.output,
        iterations: iteration,
      };
    }

    lastError = result.error;
  }

  return {
    success: false,
    error: `Failed after ${maxIterations} iterations. Last error: ${lastError}`,
    iterations: maxIterations,
  };
}

The "use workflow" directive marks this function as durable. If the process crashes or redeploys mid-run, it replays from the last completed step rather than starting over. The while loop creates the self-healing behavior: when tests fail, the next iteration includes the error message so the model can correct its approach.

4. Define the steps

Each step is an async function with the "use step" directive. Steps get automatic retries on failure, run as separate requests, and their inputs and outputs are recorded for observability.

Generate code

This step sends the coding task to Claude via AI Gateway. On retry iterations, it includes the previous error so the model can adjust its approach.

workflows/evaluate-code.ts

async function generateCode(prompt: string, previousError: string | null) {
  "use step";

  const systemPrompt = `You are a code generator. Write a single TypeScript file that solves the given task. Export the main function as a named export. Only output the code, no explanation.${
    previousError
      ? `\n\nYour previous attempt failed with this error:\n${previousError}\nFix the issue.`
      : ""
  }`;

  const { text } = await generateText({
    model: gateway("anthropic/claude-sonnet-4.6"),
    system: systemPrompt,
    prompt,
  });

  return stripCodeFences(text);
}

The model string format is provider/model-name. Because you're going through AI Gateway, swapping to a different model (for example, openai/gpt-4o) is a one-line change with no new SDK or API key required.

Generate tests

A second LLM call generates test cases for the code. The tests use Node's built-in assert module to avoid installing a test framework in the Sandbox.

workflows/evaluate-code.ts

async function generateTests(prompt: string, code: string) {
  "use step";

  const { text } = await generateText({
    model: gateway("anthropic/claude-sonnet-4.6"),
    system: `You are a test writer. Given a coding task and its implementation, write a test file using Node.js built-in assert module (no external test framework). The test file should import from './solution.ts' and run assertions. Include at least 3 test cases covering normal cases and edge cases. End the file with console.log("ALL TESTS PASSED") if all assertions pass. Only output the code.`,
    prompt: `Task: ${prompt}\n\nImplementation:\n${code}`,
  });

  return stripCodeFences(text);
}

Execute in Sandbox

Sandbox spins up an isolated Firecracker microVM, writes the generated code and tests into it, and runs them. The Sandbox is completely isolated, so even if the generated code does something destructive, it can't affect your application.

workflows/evaluate-code.ts

async function executeInSandbox(code: string, tests: string) {
  "use step";

  const sandbox = await Sandbox.create({ runtime: "node24" });

  try {
    // Write the generated files into the sandbox
    await sandbox.writeFiles([
      { path: "solution.ts", content: Buffer.from(code) },
      { path: "test.ts", content: Buffer.from(tests) },
    ]);

    // Install tsx so we can run TypeScript directly
    await sandbox.runCommand("npm", ["install", "-g", "tsx"]);

    // Run the tests
    const result = await sandbox.runCommand("npx", ["tsx", "test.ts"]);

    const stdout = await result.stdout();
    const stderr = await result.stderr();

    if (result.exitCode === 0 && stdout.includes("ALL TESTS PASSED")) {
      return { passed: true, output: stdout, error: null };
    }

    return {
      passed: false,
      output: stdout,
      error: stderr || stdout || "Tests failed with no output",
    };
  } finally {
    await sandbox.stop();
  }
}

The finally block ensures sandbox.stop() is called even if the step throws. Each Sandbox is ephemeral, so there's nothing to persist after capturing the output.

5. Create the API route

The route handler triggers the workflow and returns immediately. The workflow runs asynchronously on Vercel's infrastructure, backed by Vercel Queues.

app/api/evaluate/route.ts

import { start } from "workflow/api";
import { evaluateCode } from "@/workflows/evaluate-code";
import { NextResponse } from "next/server";

export async function POST(request: Request) {
  const { prompt } = await request.json();

  await start(evaluateCode, [prompt]);

  return NextResponse.json({
    message: "Evaluation workflow started",
  });
}

The start() function enqueues the workflow and returns a handle. The workflow itself runs durably in the background.

6. Test it

Start the development server:

npm run dev

Try a straightforward prompt first:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a function called isPalindrome that checks if a string is a valid palindrome, ignoring case and non-alphanumeric characters"}' \
  http://localhost:3000/api/evaluate

You should see a response confirming the workflow started:

{
  "message": "Evaluation workflow started"
}

Back in the terminal where your dev server is running, you'll see a series of POST requests to /.well-known/workflow/v1/flow and /.well-known/workflow/v1/step. The /step requests are your individual step functions executing (LLM calls and sandbox execution), and the /flow requests are the orchestrator progressing the workflow state between steps. The step requests taking 2-6 seconds are typically LLM calls through AI Gateway, while shorter ones are the orchestrator advancing to the next step.

These logs confirm the workflow is running, but they don't show you the result. To see the full run with inputs, outputs, and timing for each step, use the local workflow inspector covered in step 7.

Now try something harder:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a function called longestCommonSubsequence that takes two strings and returns the length of their longest common subsequence using dynamic programming"}' \
  http://localhost:3000/api/evaluate

This prompt is more likely to trigger the retry loop. The model might produce a subtly incorrect implementation on the first pass, the tests catch it, and the workflow loops back with the error context. You'll see more /step and /flow requests in your terminal as the workflow iterates. That self-healing cycle is the core value of combining durable orchestration with safe execution.

7. Inspect runs locally

The Workflow SDK includes a local web UI for inspecting runs during development. Launch it with:

npx workflow inspect runs --web

This opens a browser-based inspector that reads from .next/workflow-data/ in your project directory. You can see each run, its steps, inputs, outputs, and timing without deploying anything.

You can also inspect Sandbox activity during local development. Open the Vercel Dashboard, select your project, and navigate to Sandboxes in the left sidebar. Sandbox invocations from localhost are visible here because they authenticate through Vercel's OIDC tokens.

8. Observe in the dashboard after deploying

Deploy the app to see the full observability picture. Workflow runs are only recorded in the Vercel Dashboard for deployed applications, not from localhost.

Once deployed, open your Vercel Dashboard, select your project, and check three areas:

Workflows (left sidebar → Workflows): Each run is listed with every step, its inputs and outputs, timing, and status. If a retry happened, you can trace what failed and what the model did differently on the next attempt. No logging code is required for this; it's automatic.

Sandboxes (left sidebar → Sandboxes): Every Sandbox invocation is logged with its creation time, runtime, and duration. You can verify that sandboxes are being created and stopped correctly for each execution step.

AI Gateway (left sidebar → AI Gateway): Every LLM call shows the model used, token count, latency, and cost. This helps you understand how much each evaluation costs and whether a faster or cheaper model would work for certain steps.

Best practices

Strip markdown code fences from LLM output

LLMs frequently wrap code in markdown fences (```typescript ... ```) even when instructed not to. If those fences get written to the Sandbox as-is, the runtime tries to execute them as code and throws a TypeError. The stripCodeFences helper in step 3 handles this, but it's worth remembering whenever you write LLM output to a file system.

Learn about Workflow concepts including sleeps, hooks, and skew protection
Explore the full Sandbox SDK reference for file system operations, networking, and snapshots
See AI Gateway models and providers for the full list of available models
Read the Workflow SDK getting started guide for more on directives, error handling, and deployment
Try the Safely running AI generated code guide for a single-primitive starting point

How to build a durable AI code agent on Vercel

Read related documentation

Explore more Vercel Workflow guides