Build resilient AI that survives rate limits, timeouts, and model failures

At 3 AM during an incident, your bot hits OpenAI rate limits. Without retries and fallbacks, it dies exactly when your team needs it most. While Vercel AI Gateway handles provider-level failures, your bot still needs application-level resilience for Slack APIs, network issues, and model-specific fallbacks. Production bots don't get to fail gracefully—they get to not fail.

Outcome

Implement retry logic with exponential backoff, model fallbacks, and graceful degradation for production reliability.

Core Concept

First Try → Fails (rate limit/timeout)
    ↓
Wait 1 second → Retry
    ↓
Still fails? → Wait 2 seconds → Retry
    ↓
Still fails? → Try backup model (gpt-3.5)
    ↓
All failed? → User-friendly error message

Retry Flow Diagram

┌─────────────────────────────────────────────────────────────────┐
│                 Exponential Backoff & Fallback Flow            │
└─────────────────────────────────────────────────────────────────┘

Timeline: 0s ──── 1s ──── 3s ──── 7s ──── 15s ──── 20s ──── 25s

Request arrives (t=0)
    │
    ├─[Attempt 1: GPT-4o-mini]──X (429 rate limit @ 0.5s)
    │                            │
    │                      [Wait 1s]
    │                            │
    ├─[Attempt 2: GPT-4o-mini]──X (429 rate limit @ 2s)
    │                            │
    │                      [Wait 2s]
    │                            │
    ├─[Attempt 3: GPT-4o-mini]──X (429 rate limit @ 5s)
    │                            │
    │                      [Wait 4s]
    │                            │
    ├─[Attempt 4: GPT-4o-mini]──X (Still rate limited @ 10s)
    │                            │
    │                   [Fallback triggered]
    │                            │
    ├─[Attempt 1: GPT-3.5-turbo]─X (Network error @ 11s)
    │                            │
    │                      [Wait 1s]
    │                            │
    └─[Attempt 2: GPT-3.5-turbo]─✓ Success! (@ 13s)
                                 │
                           [Response sent]

Retry Strategy:
- Max attempts per model: 4
- Backoff multiplier: 2x
- Max wait time: 8 seconds
- Fallback chain: gpt-4o-mini → gpt-3.5-turbo → error message

Fast Track

Create basic retry wrapper with exponential backoff
Add model fallback chain to AI responses
Test with /test-resilience command

Hands-On Exercise 4.4

Build a retry wrapper that makes your bot resilient to API failures:

Requirements:

Create /slack-agent/server/lib/ai/retry-wrapper.ts with basic retry logic
Implement exponential backoff (1s → 2s → 4s → 8s)
Add model fallback in respond-to-message.ts (gpt-4o-mini → gpt-3.5-turbo)
Return a friendly error message if all retries fail

Implementation hints:

Start simple: just count attempts and increase delay
Check if error status is 429 (rate limit) to know when to retry
Use setTimeout wrapped in a Promise for delays
Keep the existing system prompt when falling back to another model

Manifest update for test command:

{
  "slash_commands": [
    {
      "command": "/test-resilience",
      "url": "https://your-ngrok-url/api/slack/events",
      "description": "Test bot resilience with simulated failures",
      "should_escape": false
    }
  ]
}

Try It

Test the resilience command:
```
/test-resilience
```

Watch the logs to see retry behavior with correlation-style tracking:

[INFO] Simulating error failure (attempt 1) { correlationId: 'retry-1757720494-a3b2c1' }
[INFO] Attempt 1 failed, retrying in 1000ms { correlationId: 'retry-1757720494-a3b2c1' }
[INFO] Simulating error failure (attempt 2) { correlationId: 'retry-1757720494-a3b2c1' }
[INFO] Attempt 2 failed, retrying in 2000ms { correlationId: 'retry-1757720494-a3b2c1' }
[INFO] Attempting with model: openai/gpt-4o-mini { correlationId: 'retry-1757720494-a3b2c1' }

If first model fails completely, see fallback:

[ERROR] Model openai/gpt-4o-mini failed after retries { correlationId: 'retry-1757720494-a3b2c1' }
[INFO] Attempting with model: openai/gpt-3.5-turbo { correlationId: 'retry-1757720494-a3b2c1' }
✅ Resilience test completed

These logs use an operation-level correlationId generated inside the retry wrapper. In a full implementation, you'd also include ...context.correlation from Bolt Middleware at the call site so retries can be tied back to the original Slack event.

Commit

git add -A
git commit -m "feat(ai): add retry logic with exponential backoff and model fallbacks"

Done-When

Failed API calls retry with exponential backoff
Rate limits respect retry-after header
Model fallbacks activate on primary failure
Users receive helpful messages during degradation
All retries logged with correlation IDs

Solution

Create /slack-agent/server/lib/ai/retry-wrapper.ts:

/slack-agent/server/lib/ai/retry-wrapper.ts

import { app } from "~/app";
 
interface RetryOptions {
  maxRetries?: number;
  initialDelayMs?: number;
}
 
// Type guard for HTTP errors
function isHttpError(error: unknown): error is {
  status: number;
  headers?: Record<string, string>;
  retryAfter?: number;
} {
  return error instanceof Error && 'status' in error;
}
 
export async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions = {}
): Promise<T> {
  const {
    maxRetries = 3,
    initialDelayMs = 1000,
  } = options;
 
  let lastError: unknown;
  const correlationId = `retry-${Date.now()}-${Math.random().toString(36).substr(2, 7)}`;
 
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
 
      // CONTROL DECISION: Don't retry client errors (except rate limits)
      // Rationale: Bad requests won't succeed on retry, fail fast to save cost and time
      if (isHttpError(error)) {
        if (error.status >= 400 && error.status < 500 && error.status !== 429) {
          app.logger.warn('Client error detected, failing fast (no retry)', {
            correlationId,
            status: error.status,
            reason: 'Client errors are permanent, retrying wastes time and money'
          });
          throw error; // Fail fast - no retry will fix this
        }
      }
 
      // CONTROL DECISION: Last attempt? Stop retrying
      // Rationale: We've exhausted retries, propagate failure to caller for graceful degradation
      if (attempt === maxRetries) {
        app.logger.error(`All ${maxRetries} attempts failed`, {
          correlationId,
          attempts: maxRetries,
          error: error instanceof Error ? error.message : String(error),
          outcome: 'Switching to fallback model or graceful degradation'
        });
        throw error; // Exhausted retries, let caller handle graceful degradation
      }
 
      // CONTROL DECISION: Calculate backoff with rate limit awareness
      // Rationale: Respect service's retry-after directive to avoid ban
      let delayMs = initialDelayMs * Math.pow(2, attempt - 1);
 
      // Check for explicit retry-after directive from service
      if (isHttpError(error) && error.status === 429) {
        const serviceRequestedDelay = (error.retryAfter ?? Number(error.headers?.['retry-after'])) * 1000;
        delayMs = serviceRequestedDelay || delayMs;
 
        app.logger.info('Rate limited, using service-requested delay', {
          correlationId,
          requestedDelayMs: serviceRequestedDelay,
          reason: 'Service told us exactly when to retry - respect it to avoid ban'
        });
      }
 
      app.logger.info(`Attempt ${attempt} failed, retrying in ${delayMs}ms`, {
        correlationId,
        nextAttempt: attempt + 1,
        strategy: 'exponential_backoff'
      });
 
      // Execute backoff delay before next attempt
      await new Promise(resolve => setTimeout(resolve, delayMs));
    }
  }
 
  throw lastError;
}
 
// Test helper for simulating failures
export function simulateFailure(type: 'error'): void {
  if (process.env.SIMULATE_FAILURES !== 'true') return;
  
  const attemptKey = `__test_attempts_${type}`;
  const attempts = (globalThis as any)[attemptKey] || 0;
  (globalThis as any)[attemptKey] = attempts + 1;
  
  // Fail first 2 attempts, succeed on 3rd
  if (attempts < 2) {
    app.logger.info(`Simulating ${type} failure (attempt ${attempts + 1})`);
    throw new Error('Simulated service error');
  }
  
  // Reset counter after success
  delete (globalThis as any)[attemptKey];
}

Create /slack-agent/server/listeners/commands/test-resilience.ts:

/slack-agent/server/listeners/commands/test-resilience.ts

import type { AllMiddlewareArgs, SlackCommandMiddlewareArgs } from "@slack/bolt";
import { respondToMessage } from "~/lib/ai/respond-to-message";
 
export const testResilienceCallback = async ({
  ack,
  command,
  client,
  logger,
}: AllMiddlewareArgs & SlackCommandMiddlewareArgs) => {
  await ack();
  
  const { user_id, channel_id } = command;
  
  try {
    // Enable failure simulation
    process.env.SIMULATE_FAILURES = 'true';
    
    const response = await client.chat.postMessage({
      channel: channel_id,
      text: `🧪 Testing resilience...`,
    });
    
    // Test the AI response with simulated failures
    const aiResponse = await respondToMessage({
      messages: [{ 
        role: 'user', 
        content: 'Test message for resilience' 
      }],
      event: {
        type: 'message',
        text: 'Test message',
        user: user_id,
        ts: response.ts!,
        channel: channel_id,
        channel_type: 'channel',
      } as any,
      channel: channel_id,
      thread_ts: response.ts,
      botId: undefined,
    });
    
    await client.chat.postMessage({
      channel: channel_id,
      thread_ts: response.ts,
      text: `✅ Resilience test completed:\n${aiResponse}`,
    });
    
  } catch (error) {
    logger.error('Test resilience failed:', error);
    await client.chat.postEphemeral({
      channel: channel_id,
      user: command.user_id,
      text: `❌ Test failed: ${error}`,
    });
  } finally {
    // Disable failure simulation
    delete process.env.SIMULATE_FAILURES;
  }
};

About the `as any` cast

For the event in this test command we use as any to avoid dragging a full set of Slack event types into the example. In your own code, prefer reusing the typed helpers and payload types from the Bolt middleware lesson instead of broad casts—this keeps your handlers fully type-safe while following the same retry patterns.

/slack-agent/server/listeners/commands/index.ts

import type { App } from "@slack/bolt";
import { echoCallback } from "./echo";
import { sampleCommandCallback } from "./sample-command";
import { testResilienceCallback } from "./test-resilience";
 
const register = (app: App) => {
  app.command("/sample-command", sampleCommandCallback);
  app.command("/echo", echoCallback);
  app.command("/test-resilience", testResilienceCallback);
};
 
export default { register };

If you already have additional commands (like /compare-context) from other lessons, register them here as well. This snippet only shows the commands relevant to the resilience test.

Update /slack-agent/server/lib/ai/respond-to-message.ts:

/slack-agent/server/lib/ai/respond-to-message.ts

import type { KnownEventFromType } from "@slack/bolt";
import { generateText, type ModelMessage, stepCountIs } from "ai";
import { withRetry, simulateFailure } from "./retry-wrapper";
import { app } from "~/app";
// ... existing imports ...
 
// Share a single system prompt builder between createTextStream and respondToMessage.
// In your code, copy the ENTIRE system prompt string from createTextStream into this
// function (including the channel_type-specific prefix) so both flows stay in sync.
const buildSystemPrompt = (
  event: KnownEventFromType<"message"> | KnownEventFromType<"app_mention">
) => `You are Slack Agent, a helpful assistant in Slack.
// ... same full system prompt as createTextStream above ...
`;
 
export const respondToMessage = async ({
  messages,
  event,
  channel,
  thread_ts,
  botId,
}: RespondToMessageOptions) => {
  // CONTROL STRATEGY: Explicit model fallback chain
  // Primary: gpt-4o-mini (fast, cheap, good quality)
  // Fallback: gpt-3.5-turbo (even cheaper, more reliable availability)
  // Rationale: If primary fails, degrade to cheaper/simpler model rather than total failure
  const models = [
    "openai/gpt-4o-mini",
    "openai/gpt-3.5-turbo",
  ];
 
  let lastError: unknown;
 
  // CONTROL FLOW: Try each model in sequence until one succeeds
  // Strategy: Fail forward through cheaper models, only fail completely as last resort
  for (const model of models) {
    try {
      // Wrap AI call with retry logic (inner control layer)
      // Outer loop: model fallback, Inner loop: network retries
      const { text, usage } = await withRetry(
        async () => {
          // Test helper: simulate failures on first model
          if (process.env.SIMULATE_FAILURES === 'true' && model === models[0]) {
            simulateFailure('error');
          }
 
          app.logger.info(`Attempting with model: ${model}`, {
            position: `${models.indexOf(model) + 1}/${models.length}`,
            reason: model === models[0] ? 'Primary model (optimal quality)' : 'Fallback model (degraded but reliable)'
          });
 
          return await generateText({
            model,
            // Reuse the SAME full system prompt you implemented in createTextStream.
            // We truncate buildSystemPrompt in this snippet for brevity, but in your
            // code both createTextStream and respondToMessage should import/use it.
            system: buildSystemPrompt(event),
            messages,
            stopWhen: stepCountIs(5),
            tools: {
              updateChatTitleTool,
              getThreadMessagesTool,
              getChannelMessagesTool,
              updateAgentStatusTool,
              reactToMessageTool,
            },
            experimental_context: {
              channel,
              thread_ts: thread_ts || event.ts,
              botId,
            } as ExperimentalContext,
            prepareStep: () => ({
              activeTools: getActiveTools(event),
            }),
            onStepFinish: ({ toolCalls }) => {
              if (toolCalls?.length) {
                app.logger.debug("tool calls:", toolCalls.map((c) => c.input));
              }
            },
          });
        },
        {
          maxRetries: 3,
          initialDelayMs: 1000,
        }
      );
 
      // CONTROL DECISION: Success - return immediately
      // Rationale: No need to try remaining models, we got a good response
      app.logger.info('AI request succeeded', {
        model,
        usage,
        outcome: 'Returning response to user'
      });
 
      return text;
    } catch (error) {
      lastError = error;
      app.logger.error(`Model ${model} failed after retries`, {
        model,
        error: error instanceof Error ? error.message : String(error),
        remainingModels: models.length - models.indexOf(model) - 1
      });
 
      // CONTROL DECISION: Last model in chain?
      // Rationale: All models exhausted - degrade gracefully with user-friendly message
      if (model === models[models.length - 1]) {
        app.logger.error('All models exhausted, returning graceful degradation message', {
          attemptedModels: models,
          outcome: 'User-friendly error message instead of raw exception'
        });
 
        // Graceful degradation: helpful message instead of stack trace
        return "I'm experiencing high demand right now. Please try again in a few moments.";
      }
 
      // Not last model - continue to next in fallback chain
      app.logger.info('Falling back to next model', {
        failed: model,
        next: models[models.indexOf(model) + 1],
        strategy: 'degraded_quality_over_no_response'
      });
    }
  }
 
  // Should never reach here due to graceful degradation above
  throw lastError;
};

Building on Previous Lessons

This lesson leverages stateless architecture for resilient operations:

From Bolt Middleware: Correlation IDs track retry attempts and model fallbacks across the full operation chain
From Repository Flyover: Context utilities (getThreadMessages, getChannelMessages) benefit from retry protection against transient Slack API failures
From system prompts, AI tools, and status communication: AI components all flow through retry wrappers
Production reasoning: Stateless handlers enable safe retries - each attempt is idempotent because we don't hold mutable state
Graceful degradation: Fallback to cheaper models (gpt-4o-mini → gpt-3.5-turbo) or cached context when primary systems fail
Sets up Deploy to Vercel: Production deployment relies on this resilience to handle real-world rate limits and network issues

Vercel AI Gateway Integration

If using Vercel AI Gateway, you get provider-level fallback automatically (e.g., OpenAI → Anthropic). This lesson's patterns still apply for:

Model-level fallbacks within a provider (gpt-4o → gpt-3.5)
Slack API resilience (not covered by Gateway)
Application-specific retry logic and testing
Correlation tracking for debugging