AI agents that generate and run code need three things: durable orchestration that survives failures, isolated execution that protects your infrastructure, and observable model access you can swap without rewriting code. Vercel Workflows, Sandbox, and AI Gateway solve each of these.
This guide covers how to combine all three in a single Next.js app that generates code from a natural language prompt, writes its own tests, runs them in an isolated microVM, and retries automatically when tests fail.
In this guide, you'll learn how to:
- Define a durable, multi-step workflow using the
"use workflow"and"use step"compiler directives - Execute AI-generated code safely in a Sandbox microVM
- Route LLM calls through AI Gateway with a single API key
- Build a self-healing agentic loop that retries on failure
- Observe every step, LLM call, and execution result in the Vercel Dashboard
A user submits a coding task in plain English. A Workflow orchestrates the entire process:
- AI Gateway routes a request to Claude to generate code
- A second AI Gateway call generates test cases
- Sandbox spins up an isolated microVM to run the code and tests.
If the tests fail, the Workflow loops back to the generation step with the error context, and the cycle repeats up to three times. Every step is durable, observable, and automatic.
The three primitives each handle a distinct responsibility:
- Workflows manages orchestration, retries, and state. If a step crashes, it replays deterministically from the last completed step.
- Sandbox handles execution. AI-generated code runs in an ephemeral Firecracker microVM with its own filesystem and network. Nothing touches your server.
- AI Gateway handles model access. One API key, any model, with spend tracking and observability built in.
Before you begin, make sure you have:
- A Vercel account
- A Vercel project linked to your local repository. AI Gateway and Sandbox both authenticate automatically via OIDC tokens when deployed on Vercel. For local development,
vercel env pullpulls the token into your.env.local. - Basic knowledge of TypeScript and Next.js
- The Vercel CLI installed
Create a new Next.js project and install the dependencies:
Link your project to Vercel and pull environment variables. Sandbox uses Vercel OIDC tokens to authenticate, so this step is required for local development:
OIDC tokens expire every 12 hours during local development. If your token expires, run vercel env pull again. For longer local sessions, you can create an AI Gateway API key in the dashboard and add it to .env.local as AI_GATEWAY_API_KEY instead.
Wrap your next.config.ts with withWorkflow() to enable the "use workflow" and "use step" compiler directives. Without this wrapper, they're string literals with no effect.
Create workflows/evaluate-code.ts. This file is the orchestrator: it coordinates the steps, handles the retry loop, and passes error context between iterations.
The "use workflow" directive marks this function as durable. If the process crashes or redeploys mid-run, it replays from the last completed step rather than starting over. The while loop creates the self-healing behavior: when tests fail, the next iteration includes the error message so the model can correct its approach.
Each step is an async function with the "use step" directive. Steps get automatic retries on failure, run as separate requests, and their inputs and outputs are recorded for observability.
This step sends the coding task to Claude via AI Gateway. On retry iterations, it includes the previous error so the model can adjust its approach.
The model string format is provider/model-name. Because you're going through AI Gateway, swapping to a different model (for example, openai/gpt-4o) is a one-line change with no new SDK or API key required.
A second LLM call generates test cases for the code. The tests use Node's built-in assert module to avoid installing a test framework in the Sandbox.
Sandbox spins up an isolated Firecracker microVM, writes the generated code and tests into it, and runs them. The Sandbox is completely isolated, so even if the generated code does something destructive, it can't affect your application.
The finally block ensures sandbox.stop() is called even if the step throws. Each Sandbox is ephemeral, so there's nothing to persist after capturing the output.
The route handler triggers the workflow and returns immediately. The workflow runs asynchronously on Vercel's infrastructure, backed by Vercel Queues.
The start() function enqueues the workflow and returns a handle. The workflow itself runs durably in the background.
Start the development server:
Try a straightforward prompt first:
You should see a response confirming the workflow started:
Back in the terminal where your dev server is running, you'll see a series of POST requests to /.well-known/workflow/v1/flow and /.well-known/workflow/v1/step. The /step requests are your individual step functions executing (LLM calls and sandbox execution), and the /flow requests are the orchestrator progressing the workflow state between steps. The step requests taking 2-6 seconds are typically LLM calls through AI Gateway, while shorter ones are the orchestrator advancing to the next step.
These logs confirm the workflow is running, but they don't show you the result. To see the full run with inputs, outputs, and timing for each step, use the local workflow inspector covered in step 7.
Now try something harder:
This prompt is more likely to trigger the retry loop. The model might produce a subtly incorrect implementation on the first pass, the tests catch it, and the workflow loops back with the error context. You'll see more /step and /flow requests in your terminal as the workflow iterates. That self-healing cycle is the core value of combining durable orchestration with safe execution.
The Workflow SDK includes a local web UI for inspecting runs during development. Launch it with:
This opens a browser-based inspector that reads from .next/workflow-data/ in your project directory. You can see each run, its steps, inputs, outputs, and timing without deploying anything.
You can also inspect Sandbox activity during local development. Open the Vercel Dashboard, select your project, and navigate to Sandboxes in the left sidebar. Sandbox invocations from localhost are visible here because they authenticate through Vercel's OIDC tokens.
Deploy the app to see the full observability picture. Workflow runs are only recorded in the Vercel Dashboard for deployed applications, not from localhost.
Once deployed, open your Vercel Dashboard, select your project, and check three areas:
Workflows (left sidebar → Workflows): Each run is listed with every step, its inputs and outputs, timing, and status. If a retry happened, you can trace what failed and what the model did differently on the next attempt. No logging code is required for this; it's automatic.
Sandboxes (left sidebar → Sandboxes): Every Sandbox invocation is logged with its creation time, runtime, and duration. You can verify that sandboxes are being created and stopped correctly for each execution step.
AI Gateway (left sidebar → AI Gateway): Every LLM call shows the model used, token count, latency, and cost. This helps you understand how much each evaluation costs and whether a faster or cheaper model would work for certain steps.
LLMs frequently wrap code in markdown fences (```typescript ... ```) even when instructed not to. If those fences get written to the Sandbox as-is, the runtime tries to execute them as code and throws a TypeError. The stripCodeFences helper in step 3 handles this, but it's worth remembering whenever you write LLM output to a file system.
Use a finally block to call sandbox.stop(). Sandboxes have a default 5-minute timeout, but stopping them explicitly keeps costs down and avoids hitting concurrency limits.
By default, steps retry on any thrown error. If you want a step to fail permanently (for example, invalid user input), throw a FatalError from the workflow package instead.
If every run installs the same dependencies, snapshot a Sandbox after the install step and create new instances from that snapshot. This saves setup time on every subsequent run.
AI Gateway makes it straightforward to assign different models to different steps. You might use a faster, cheaper model for test generation and a more capable one for code generation, each specified independently.
- Learn about Workflow concepts including sleeps, hooks, and skew protection
- Explore the full Sandbox SDK reference for file system operations, networking, and snapshots
- See AI Gateway models and providers for the full list of available models
- Read the Workflow SDK getting started guide for more on directives, error handling, and deployment
- Try the Safely running AI generated code guide for a single-primitive starting point