Skip to content

How to build a durable AI code agent on Vercel

Build an AI agent that generates code, writes its own tests, and executes them in an isolated microVM with automatic retries.

Ben Akehurst
10 min read
Last updated May 13, 2026

AI agents that generate and run code need three things: durable orchestration that survives failures, isolated execution that protects your infrastructure, and observable model access you can swap without rewriting code. Vercel Workflows, Sandbox, and AI Gateway solve each of these.

This guide covers how to combine all three in a single Next.js app that generates code from a natural language prompt, writes its own tests, runs them in an isolated microVM, and retries automatically when tests fail.

In this guide, you'll learn how to:

  • Define a durable, multi-step workflow using the "use workflow" and "use step" compiler directives
  • Execute AI-generated code safely in a Sandbox microVM
  • Route LLM calls through AI Gateway with a single API key
  • Build a self-healing agentic loop that retries on failure
  • Observe every step, LLM call, and execution result in the Vercel Dashboard

A user submits a coding task in plain English. A Workflow orchestrates the entire process:

  • AI Gateway routes a request to Claude to generate code
  • A second AI Gateway call generates test cases
  • Sandbox spins up an isolated microVM to run the code and tests.

If the tests fail, the Workflow loops back to the generation step with the error context, and the cycle repeats up to three times. Every step is durable, observable, and automatic.

flow diagram
User prompt
POST /api/evaluate → start(evaluateCode, [prompt])
evaluateCode() ← "use workflow"
├── generateCode() ← "use step" + AI Gateway
├── generateTests() ← "use step" + AI Gateway
├── executeInSandbox() ← "use step" + Sandbox
├── Tests pass?
│ ├── Yes → return results
│ └── No → loop back with error context (max 3 attempts)
Results returned

The three primitives each handle a distinct responsibility:

  • Workflows manages orchestration, retries, and state. If a step crashes, it replays deterministically from the last completed step.
  • Sandbox handles execution. AI-generated code runs in an ephemeral Firecracker microVM with its own filesystem and network. Nothing touches your server.
  • AI Gateway handles model access. One API key, any model, with spend tracking and observability built in.

Before you begin, make sure you have:

  • A Vercel account
  • A Vercel project linked to your local repository. AI Gateway and Sandbox both authenticate automatically via OIDC tokens when deployed on Vercel. For local development, vercel env pull pulls the token into your .env.local.
  • Basic knowledge of TypeScript and Next.js
  • The Vercel CLI installed

Create a new Next.js project and install the dependencies:

terminal
npx create-next-app@latest ai-code-evaluator --typescript --tailwind --app
cd ai-code-evaluator
npm install workflow @vercel/sandbox ai @ai-sdk/gateway

Link your project to Vercel and pull environment variables. Sandbox uses Vercel OIDC tokens to authenticate, so this step is required for local development:

terminal
vercel link
vercel env pull

OIDC tokens expire every 12 hours during local development. If your token expires, run vercel env pull again. For longer local sessions, you can create an AI Gateway API key in the dashboard and add it to .env.local as AI_GATEWAY_API_KEY instead.

Wrap your next.config.ts with withWorkflow() to enable the "use workflow" and "use step" compiler directives. Without this wrapper, they're string literals with no effect.

next.config.ts
import { withWorkflow } from "workflow/next";
import type { NextConfig } from "next";
const nextConfig: NextConfig = {};
export default withWorkflow(nextConfig);

Create workflows/evaluate-code.ts. This file is the orchestrator: it coordinates the steps, handles the retry loop, and passes error context between iterations.

workflows/evaluate-code.ts
import { generateText } from "ai";
import { gateway } from "@ai-sdk/gateway";
import { Sandbox } from "@vercel/sandbox";
// LLMs often wrap output in markdown code fences even when told not to.
// Strip them so the raw code can be written directly to the sandbox.
function stripCodeFences(text: string): string {
return text.replace(/^```[\w]*\n?/gm, "").replace(/```\s*$/gm, "").trim();
}
export async function evaluateCode(prompt: string) {
"use workflow";
const maxIterations = 3;
let iteration = 0;
let lastError: string | null = null;
while (iteration < maxIterations) {
iteration++;
const code = await generateCode(prompt, lastError);
const tests = await generateTests(prompt, code);
const result = await executeInSandbox(code, tests);
if (result.passed) {
return {
success: true,
code,
tests,
output: result.output,
iterations: iteration,
};
}
lastError = result.error;
}
return {
success: false,
error: `Failed after ${maxIterations} iterations. Last error: ${lastError}`,
iterations: maxIterations,
};
}

The "use workflow" directive marks this function as durable. If the process crashes or redeploys mid-run, it replays from the last completed step rather than starting over. The while loop creates the self-healing behavior: when tests fail, the next iteration includes the error message so the model can correct its approach.

Each step is an async function with the "use step" directive. Steps get automatic retries on failure, run as separate requests, and their inputs and outputs are recorded for observability.

This step sends the coding task to Claude via AI Gateway. On retry iterations, it includes the previous error so the model can adjust its approach.

workflows/evaluate-code.ts
async function generateCode(prompt: string, previousError: string | null) {
"use step";
const systemPrompt = `You are a code generator. Write a single TypeScript file that solves the given task. Export the main function as a named export. Only output the code, no explanation.${
previousError
? `\n\nYour previous attempt failed with this error:\n${previousError}\nFix the issue.`
: ""
}`;
const { text } = await generateText({
model: gateway("anthropic/claude-sonnet-4.6"),
system: systemPrompt,
prompt,
});
return stripCodeFences(text);
}

The model string format is provider/model-name. Because you're going through AI Gateway, swapping to a different model (for example, openai/gpt-4o) is a one-line change with no new SDK or API key required.

A second LLM call generates test cases for the code. The tests use Node's built-in assert module to avoid installing a test framework in the Sandbox.

workflows/evaluate-code.ts
async function generateTests(prompt: string, code: string) {
"use step";
const { text } = await generateText({
model: gateway("anthropic/claude-sonnet-4.6"),
system: `You are a test writer. Given a coding task and its implementation, write a test file using Node.js built-in assert module (no external test framework). The test file should import from './solution.ts' and run assertions. Include at least 3 test cases covering normal cases and edge cases. End the file with console.log("ALL TESTS PASSED") if all assertions pass. Only output the code.`,
prompt: `Task: ${prompt}\n\nImplementation:\n${code}`,
});
return stripCodeFences(text);
}

Sandbox spins up an isolated Firecracker microVM, writes the generated code and tests into it, and runs them. The Sandbox is completely isolated, so even if the generated code does something destructive, it can't affect your application.

workflows/evaluate-code.ts
async function executeInSandbox(code: string, tests: string) {
"use step";
const sandbox = await Sandbox.create({ runtime: "node24" });
try {
// Write the generated files into the sandbox
await sandbox.writeFiles([
{ path: "solution.ts", content: Buffer.from(code) },
{ path: "test.ts", content: Buffer.from(tests) },
]);
// Install tsx so we can run TypeScript directly
await sandbox.runCommand("npm", ["install", "-g", "tsx"]);
// Run the tests
const result = await sandbox.runCommand("npx", ["tsx", "test.ts"]);
const stdout = await result.stdout();
const stderr = await result.stderr();
if (result.exitCode === 0 && stdout.includes("ALL TESTS PASSED")) {
return { passed: true, output: stdout, error: null };
}
return {
passed: false,
output: stdout,
error: stderr || stdout || "Tests failed with no output",
};
} finally {
await sandbox.stop();
}
}

The finally block ensures sandbox.stop() is called even if the step throws. Each Sandbox is ephemeral, so there's nothing to persist after capturing the output.

The route handler triggers the workflow and returns immediately. The workflow runs asynchronously on Vercel's infrastructure, backed by Vercel Queues.

app/api/evaluate/route.ts
import { start } from "workflow/api";
import { evaluateCode } from "@/workflows/evaluate-code";
import { NextResponse } from "next/server";
export async function POST(request: Request) {
const { prompt } = await request.json();
await start(evaluateCode, [prompt]);
return NextResponse.json({
message: "Evaluation workflow started",
});
}

The start() function enqueues the workflow and returns a handle. The workflow itself runs durably in the background.

Start the development server:

npm run dev

Try a straightforward prompt first:

curl -X POST \
-H "Content-Type: application/json" \
-d '{"prompt":"Write a function called isPalindrome that checks if a string is a valid palindrome, ignoring case and non-alphanumeric characters"}' \
http://localhost:3000/api/evaluate

You should see a response confirming the workflow started:

{
"message": "Evaluation workflow started"
}

Back in the terminal where your dev server is running, you'll see a series of POST requests to /.well-known/workflow/v1/flow and /.well-known/workflow/v1/step. The /step requests are your individual step functions executing (LLM calls and sandbox execution), and the /flow requests are the orchestrator progressing the workflow state between steps. The step requests taking 2-6 seconds are typically LLM calls through AI Gateway, while shorter ones are the orchestrator advancing to the next step.

These logs confirm the workflow is running, but they don't show you the result. To see the full run with inputs, outputs, and timing for each step, use the local workflow inspector covered in step 7.

Now try something harder:

curl -X POST \
-H "Content-Type: application/json" \
-d '{"prompt":"Write a function called longestCommonSubsequence that takes two strings and returns the length of their longest common subsequence using dynamic programming"}' \
http://localhost:3000/api/evaluate

This prompt is more likely to trigger the retry loop. The model might produce a subtly incorrect implementation on the first pass, the tests catch it, and the workflow loops back with the error context. You'll see more /step and /flow requests in your terminal as the workflow iterates. That self-healing cycle is the core value of combining durable orchestration with safe execution.

The Workflow SDK includes a local web UI for inspecting runs during development. Launch it with:

npx workflow inspect runs --web

This opens a browser-based inspector that reads from .next/workflow-data/ in your project directory. You can see each run, its steps, inputs, outputs, and timing without deploying anything.

You can also inspect Sandbox activity during local development. Open the Vercel Dashboard, select your project, and navigate to Sandboxes in the left sidebar. Sandbox invocations from localhost are visible here because they authenticate through Vercel's OIDC tokens.

Deploy the app to see the full observability picture. Workflow runs are only recorded in the Vercel Dashboard for deployed applications, not from localhost.

Once deployed, open your Vercel Dashboard, select your project, and check three areas:

Workflows (left sidebar → Workflows): Each run is listed with every step, its inputs and outputs, timing, and status. If a retry happened, you can trace what failed and what the model did differently on the next attempt. No logging code is required for this; it's automatic.

Sandboxes (left sidebar → Sandboxes): Every Sandbox invocation is logged with its creation time, runtime, and duration. You can verify that sandboxes are being created and stopped correctly for each execution step.

AI Gateway (left sidebar → AI Gateway): Every LLM call shows the model used, token count, latency, and cost. This helps you understand how much each evaluation costs and whether a faster or cheaper model would work for certain steps.

LLMs frequently wrap code in markdown fences (```typescript ... ```) even when instructed not to. If those fences get written to the Sandbox as-is, the runtime tries to execute them as code and throws a TypeError. The stripCodeFences helper in step 3 handles this, but it's worth remembering whenever you write LLM output to a file system.

Use a finally block to call sandbox.stop(). Sandboxes have a default 5-minute timeout, but stopping them explicitly keeps costs down and avoids hitting concurrency limits.

By default, steps retry on any thrown error. If you want a step to fail permanently (for example, invalid user input), throw a FatalError from the workflow package instead.

If every run installs the same dependencies, snapshot a Sandbox after the install step and create new instances from that snapshot. This saves setup time on every subsequent run.

AI Gateway makes it straightforward to assign different models to different steps. You might use a faster, cheaper model for test generation and a more capable one for code generation, each specified independently.

Was this helpful?

supported.