Vercel Logo

Workflow Error Handling

Right now, every failure in the workflow looks the same. The weather API times out? Step fails, retries three times, gives up. An alert references a resort called narnia? Step fails, retries three times with the same bad ID, gives up. Three wasted retries on something that was never going to work.

The Workflow DevKit gives you two error classes to fix this. FatalError says "stop trying, this is broken forever." RetryableError says "try again, but wait a bit first." And getStepMetadata() tells you which attempt you're on, so you can back off gracefully instead of hammering a struggling API.

Outcome

Add error classification to the evaluateResort step with FatalError for permanent failures, RetryableError with exponential backoff for transient failures, and getStepMetadata() for attempt-aware logic.

Fast Track

  1. Throw FatalError for invalid resort IDs (no point retrying)
  2. Catch weather API failures and throw RetryableError with a retryAfter duration
  3. Use getStepMetadata().attempt to calculate exponential backoff

Three Kinds of Failure

Resort not found → FatalError
  "narnia doesn't exist. Stop trying."

Weather API timeout → RetryableError
  "Open-Meteo is slow right now. Try again in 5 seconds."

Unknown error → Default retry
  "Something unexpected happened. Retry with default timing."
Error TypeBehaviorWhen to Use
FatalErrorImmediately fails the step, skips all retriesBad data, missing resources, auth failures
RetryableErrorRetries after the specified delayAPI timeouts, rate limits, 503 errors
Unhandled errorRetries with default timing (up to 3 times)Unexpected failures

Without these classes, every error gets the same default retry behavior. That means three identical requests to a resort that doesn't exist. With error classification, the first failure is the last.

Hands-on exercise 3.3

Add error handling to the evaluateResort step:

Requirements:

  1. Import FatalError, RetryableError, and getStepMetadata from workflow
  2. Throw FatalError when getResort(resortId) returns nothing (permanent failure)
  3. Wrap fetchWeather() in try/catch and throw RetryableError on failure
  4. Use getStepMetadata().attempt to calculate exponential backoff for the retryAfter option
  5. Log the attempt number so you can track retries in server logs

Implementation hints:

  • FatalError and RetryableError are imported from workflow
  • RetryableError accepts a second argument: { retryAfter: '5s' } with a duration string, milliseconds as a number, or a Date
  • getStepMetadata() returns { attempt, stepId }. The attempt count starts at 1
  • Exponential backoff formula: Math.min(1000 * 2^(attempt-1), 30000) caps at 30 seconds
  • You can set a custom retry limit on a step function: evaluateResort.maxRetries = 5 (6 total attempts)

Try It

  1. Test with a valid resort (should work as before):

    $ curl -X POST http://localhost:5173/api/workflow \
      -H "Content-Type: application/json" \
      -d '{"alerts": [{"id": "a1", "resortId": "mammoth", "condition": {"type": "conditions", "match": "powder"}, "originalQuery": "test", "createdAt": "2025-01-01", "triggered": false}]}'

    No errors in the response. Workflow completes normally.

  2. Test with an invalid resort ID:

    $ curl -X POST http://localhost:5173/api/workflow \
      -H "Content-Type: application/json" \
      -d '{"alerts": [{"id": "a1", "resortId": "narnia", "condition": {"type": "conditions", "match": "powder"}, "originalQuery": "test", "createdAt": "2025-01-01", "triggered": false}, {"id": "a2", "resortId": "steamboat", "condition": {"type": "conditions", "match": "powder"}, "originalQuery": "test", "createdAt": "2025-01-01", "triggered": false}]}'

    Open npx workflow web. The evaluateResort step for narnia should show as immediately failed with no retries. The steamboat step should succeed normally.

  3. Check server logs:

    [Evaluate] Fatal: Resort not found: narnia
    [Workflow] Round complete { round: 1, evaluated: 1, triggered: 0 }
    

    The fatal error is logged once. No retry attempts.

  4. Inspect in the dashboard:

    npx workflow web

    Click into the workflow run. You should see:

    • evaluateResort (narnia): failed, 0 retries, FatalError
    • evaluateResort (steamboat): completed successfully

Commit

git add -A
git commit -m "feat(workflow): add FatalError and RetryableError handling"
git push

Done-When

  • Invalid resort IDs throw FatalError and skip all retries
  • Weather API failures throw RetryableError with a retryAfter duration
  • getStepMetadata().attempt drives exponential backoff
  • npx workflow web shows fatal stops and retry attempts
  • Valid resorts still process successfully alongside failures

Solution

workflows/evaluate-alerts.ts
import { sleep, FatalError, RetryableError, getStepMetadata } from 'workflow';
import type { Alert } from '$lib/schemas/alert';
 
interface EvaluateInput {
  alerts: Alert[];
  recheckCount?: number;
}
 
interface AlertResult {
  alertId: string;
  resortId: string;
  triggered: boolean;
}
 
export default async function evaluateAlerts(
  { alerts, recheckCount = 0 }: EvaluateInput
) {
  "use workflow";
 
  const alertsByResort = Object.groupBy(alerts, (a) => a.resortId);
  const resortIds = Object.keys(alertsByResort);
 
  const results = await Promise.all(
    resortIds.map((resortId) =>
      evaluateResort(resortId, alertsByResort[resortId]!)
    )
  );
 
  const allResults = results.flat();
  const triggered = allResults.filter((r) => r.triggered);
 
  console.log('[Workflow] Round complete', {
    round: recheckCount + 1,
    evaluated: allResults.length,
    triggered: triggered.length
  });
 
  if (triggered.length === 0 && recheckCount < 3) {
    await sleep('30m');
    return evaluateAlerts({ alerts, recheckCount: recheckCount + 1 });
  }
 
  return {
    results: allResults,
    rounds: recheckCount + 1,
    triggered: triggered.length
  };
}
 
async function evaluateResort(
  resortId: string,
  alerts: Alert[]
): Promise<AlertResult[]> {
  "use step";
 
  const { attempt } = getStepMetadata();
  const { getResort } = await import('$lib/data/resorts');
  const { fetchWeather } = await import('$lib/services/weather');
  const { evaluateCondition } = await import('$lib/services/alerts');
 
  // Permanent failure: resort doesn't exist
  const resort = getResort(resortId);
  if (!resort) {
    console.error(`[Evaluate] Fatal: Resort not found: ${resortId}`);
    throw new FatalError(`Resort not found: ${resortId}`);
  }
 
  // Transient failure: weather API might be down
  let weather;
  try {
    weather = await fetchWeather(resort);
  } catch (error) {
    const backoff = Math.min(1000 * Math.pow(2, attempt - 1), 30000);
    console.warn(
      `[Evaluate] Weather fetch failed for ${resort.name}, attempt ${attempt}`,
      error
    );
    throw new RetryableError(
      `Weather API failed for ${resort.name}`,
      { retryAfter: backoff }
    );
  }
 
  return alerts.map((alert) => ({
    alertId: alert.id,
    resortId,
    triggered: evaluateCondition(alert.condition, weather)
  }));
}

Three changes from lesson 3.2:

FatalError for bad resort IDs. If getResort() returns nothing, there's no resort to evaluate. FatalError stops the step immediately with zero retries. In the dashboard, you'll see it marked as a permanent failure.

RetryableError for weather API failures. The fetchWeather() call is wrapped in try/catch. When it fails, we throw RetryableError with a retryAfter value. The Workflow DevKit waits that long before the next attempt. Since retryAfter accepts milliseconds, we can pass the backoff calculation directly.

Exponential backoff with getStepMetadata().attempt. The attempt number starts at 1. The formula 1000 * 2^(attempt-1) gives us 1s, 2s, 4s, 8s, 16s, capped at 30s. This prevents hammering a struggling API with rapid retries.

The workflow function itself doesn't change. It still uses Promise.all to dispatch parallel steps. FatalError and RetryableError only affect the individual step that threw them. Other steps continue independently.

Troubleshooting

FatalError doesn't stop retries

Make sure you're importing FatalError from workflow, not defining your own class. The Workflow DevKit checks the error prototype to determine behavior. A custom class with the same name won't work.

RetryableError retries immediately instead of waiting

Check the retryAfter value. It accepts a duration string ('5s'), milliseconds as a number (5000), or a Date object. If you pass a string that isn't a valid duration format, the delay may be ignored.

Advanced: Custom Retry Limits

By default, steps retry 3 times (4 total attempts). You can customize this per step:

async function evaluateResort(resortId: string, alerts: Alert[]) {
  "use step";
  // ...step logic
}
 
// Allow more retries for flaky APIs
evaluateResort.maxRetries = 5; // 6 total attempts

Set maxRetries = 0 for steps that should never retry (one attempt only). Combine this with FatalError for steps where any failure is permanent.

Advanced: Idempotency Keys

getStepMetadata() also returns a stepId that's stable across retries. Use it as an idempotency key for external APIs:

async function sendNotification(userId: string, message: string) {
  "use step";
 
  const { stepId } = getStepMetadata();
 
  await fetch('https://api.notifications.example/send', {
    method: 'POST',
    headers: { 'Idempotency-Key': stepId },
    body: JSON.stringify({ userId, message })
  });
}

If the step retries, the same stepId is sent again. The external API sees the duplicate key and skips the second send. No double notifications, even with retries.