Workflow Error Handling
Right now, every failure in the workflow looks the same. The weather API times out? Step fails, retries three times, gives up. An alert references a resort called narnia? Step fails, retries three times with the same bad ID, gives up. Three wasted retries on something that was never going to work.
The Workflow DevKit gives you two error classes to fix this. FatalError says "stop trying, this is broken forever." RetryableError says "try again, but wait a bit first." And getStepMetadata() tells you which attempt you're on, so you can back off gracefully instead of hammering a struggling API.
Outcome
Add error classification to the evaluateResort step with FatalError for permanent failures, RetryableError with exponential backoff for transient failures, and getStepMetadata() for attempt-aware logic.
Fast Track
- Throw
FatalErrorfor invalid resort IDs (no point retrying) - Catch weather API failures and throw
RetryableErrorwith aretryAfterduration - Use
getStepMetadata().attemptto calculate exponential backoff
Three Kinds of Failure
Resort not found → FatalError
"narnia doesn't exist. Stop trying."
Weather API timeout → RetryableError
"Open-Meteo is slow right now. Try again in 5 seconds."
Unknown error → Default retry
"Something unexpected happened. Retry with default timing."
| Error Type | Behavior | When to Use |
|---|---|---|
FatalError | Immediately fails the step, skips all retries | Bad data, missing resources, auth failures |
RetryableError | Retries after the specified delay | API timeouts, rate limits, 503 errors |
| Unhandled error | Retries with default timing (up to 3 times) | Unexpected failures |
Without these classes, every error gets the same default retry behavior. That means three identical requests to a resort that doesn't exist. With error classification, the first failure is the last.
Hands-on exercise 3.3
Add error handling to the evaluateResort step:
Requirements:
- Import
FatalError,RetryableError, andgetStepMetadatafromworkflow - Throw
FatalErrorwhengetResort(resortId)returns nothing (permanent failure) - Wrap
fetchWeather()in try/catch and throwRetryableErroron failure - Use
getStepMetadata().attemptto calculate exponential backoff for theretryAfteroption - Log the attempt number so you can track retries in server logs
Implementation hints:
FatalErrorandRetryableErrorare imported fromworkflowRetryableErroraccepts a second argument:{ retryAfter: '5s' }with a duration string, milliseconds as a number, or aDategetStepMetadata()returns{ attempt, stepId }. Theattemptcount starts at 1- Exponential backoff formula:
Math.min(1000 * 2^(attempt-1), 30000)caps at 30 seconds - You can set a custom retry limit on a step function:
evaluateResort.maxRetries = 5(6 total attempts)
Try It
-
Test with a valid resort (should work as before):
$ curl -X POST http://localhost:5173/api/workflow \ -H "Content-Type: application/json" \ -d '{"alerts": [{"id": "a1", "resortId": "mammoth", "condition": {"type": "conditions", "match": "powder"}, "originalQuery": "test", "createdAt": "2025-01-01", "triggered": false}]}'No errors in the response. Workflow completes normally.
-
Test with an invalid resort ID:
$ curl -X POST http://localhost:5173/api/workflow \ -H "Content-Type: application/json" \ -d '{"alerts": [{"id": "a1", "resortId": "narnia", "condition": {"type": "conditions", "match": "powder"}, "originalQuery": "test", "createdAt": "2025-01-01", "triggered": false}, {"id": "a2", "resortId": "steamboat", "condition": {"type": "conditions", "match": "powder"}, "originalQuery": "test", "createdAt": "2025-01-01", "triggered": false}]}'Open
npx workflow web. TheevaluateResortstep fornarniashould show as immediately failed with no retries. Thesteamboatstep should succeed normally. -
Check server logs:
[Evaluate] Fatal: Resort not found: narnia [Workflow] Round complete { round: 1, evaluated: 1, triggered: 0 }The fatal error is logged once. No retry attempts.
-
Inspect in the dashboard:
npx workflow webClick into the workflow run. You should see:
evaluateResort (narnia): failed, 0 retries,FatalErrorevaluateResort (steamboat): completed successfully
Commit
git add -A
git commit -m "feat(workflow): add FatalError and RetryableError handling"
git pushDone-When
- Invalid resort IDs throw
FatalErrorand skip all retries - Weather API failures throw
RetryableErrorwith aretryAfterduration getStepMetadata().attemptdrives exponential backoffnpx workflow webshows fatal stops and retry attempts- Valid resorts still process successfully alongside failures
Solution
import { sleep, FatalError, RetryableError, getStepMetadata } from 'workflow';
import type { Alert } from '$lib/schemas/alert';
interface EvaluateInput {
alerts: Alert[];
recheckCount?: number;
}
interface AlertResult {
alertId: string;
resortId: string;
triggered: boolean;
}
export default async function evaluateAlerts(
{ alerts, recheckCount = 0 }: EvaluateInput
) {
"use workflow";
const alertsByResort = Object.groupBy(alerts, (a) => a.resortId);
const resortIds = Object.keys(alertsByResort);
const results = await Promise.all(
resortIds.map((resortId) =>
evaluateResort(resortId, alertsByResort[resortId]!)
)
);
const allResults = results.flat();
const triggered = allResults.filter((r) => r.triggered);
console.log('[Workflow] Round complete', {
round: recheckCount + 1,
evaluated: allResults.length,
triggered: triggered.length
});
if (triggered.length === 0 && recheckCount < 3) {
await sleep('30m');
return evaluateAlerts({ alerts, recheckCount: recheckCount + 1 });
}
return {
results: allResults,
rounds: recheckCount + 1,
triggered: triggered.length
};
}
async function evaluateResort(
resortId: string,
alerts: Alert[]
): Promise<AlertResult[]> {
"use step";
const { attempt } = getStepMetadata();
const { getResort } = await import('$lib/data/resorts');
const { fetchWeather } = await import('$lib/services/weather');
const { evaluateCondition } = await import('$lib/services/alerts');
// Permanent failure: resort doesn't exist
const resort = getResort(resortId);
if (!resort) {
console.error(`[Evaluate] Fatal: Resort not found: ${resortId}`);
throw new FatalError(`Resort not found: ${resortId}`);
}
// Transient failure: weather API might be down
let weather;
try {
weather = await fetchWeather(resort);
} catch (error) {
const backoff = Math.min(1000 * Math.pow(2, attempt - 1), 30000);
console.warn(
`[Evaluate] Weather fetch failed for ${resort.name}, attempt ${attempt}`,
error
);
throw new RetryableError(
`Weather API failed for ${resort.name}`,
{ retryAfter: backoff }
);
}
return alerts.map((alert) => ({
alertId: alert.id,
resortId,
triggered: evaluateCondition(alert.condition, weather)
}));
}Three changes from lesson 3.2:
FatalError for bad resort IDs. If getResort() returns nothing, there's no resort to evaluate. FatalError stops the step immediately with zero retries. In the dashboard, you'll see it marked as a permanent failure.
RetryableError for weather API failures. The fetchWeather() call is wrapped in try/catch. When it fails, we throw RetryableError with a retryAfter value. The Workflow DevKit waits that long before the next attempt. Since retryAfter accepts milliseconds, we can pass the backoff calculation directly.
Exponential backoff with getStepMetadata().attempt. The attempt number starts at 1. The formula 1000 * 2^(attempt-1) gives us 1s, 2s, 4s, 8s, 16s, capped at 30s. This prevents hammering a struggling API with rapid retries.
The workflow function itself doesn't change. It still uses Promise.all to dispatch parallel steps. FatalError and RetryableError only affect the individual step that threw them. Other steps continue independently.
Troubleshooting
Make sure you're importing FatalError from workflow, not defining your own class. The Workflow DevKit checks the error prototype to determine behavior. A custom class with the same name won't work.
Check the retryAfter value. It accepts a duration string ('5s'), milliseconds as a number (5000), or a Date object. If you pass a string that isn't a valid duration format, the delay may be ignored.
Advanced: Custom Retry Limits
By default, steps retry 3 times (4 total attempts). You can customize this per step:
async function evaluateResort(resortId: string, alerts: Alert[]) {
"use step";
// ...step logic
}
// Allow more retries for flaky APIs
evaluateResort.maxRetries = 5; // 6 total attemptsSet maxRetries = 0 for steps that should never retry (one attempt only). Combine this with FatalError for steps where any failure is permanent.
Advanced: Idempotency Keys
getStepMetadata() also returns a stepId that's stable across retries. Use it as an idempotency key for external APIs:
async function sendNotification(userId: string, message: string) {
"use step";
const { stepId } = getStepMetadata();
await fetch('https://api.notifications.example/send', {
method: 'POST',
headers: { 'Idempotency-Key': stepId },
body: JSON.stringify({ userId, message })
});
}If the step retries, the same stepId is sent again. The external API sees the duplicate key and skips the second send. No double notifications, even with retries.
Was this helpful?