Vercel Logo

Build Your Operations Runbook So Anyone Can Fix Outages

Your bot just crashed during the quarterly board meeting demo. The on-call engineer doesn't know TypeScript. They need a step-by-step guide to diagnose, mitigate, and resolve the incident. Without a runbook, they're guessing. With one, they're following a proven playbook that gets the bot back online in minutes, not hours.

Outcome

Create a comprehensive RUNBOOK.md with SLOs, incident procedures, and simulate a production incident to validate your response process.

Fast Track

  1. Create RUNBOOK.md with sections: Setup, Secrets, Deploy, Incidents, Rollback
  2. Define SLOs: ack < 3s (p99), response < 15s (p95), error rate < 1%
  3. Simulate a rate limit incident and follow the runbook to resolution

Building on Previous Lessons

Your runbook leverages everything we've built:

Hands-On Exercise 5.3

Create an operations runbook and validate it with incident simulation:

Requirements:

  1. Create RUNBOOK.md with all operational procedures
  2. Define SLOs with specific thresholds
  3. Document incident response flowchart
  4. Include rollback procedures with verification steps
  5. Simulate a rate limit incident and resolve using the runbook

Implementation hints:

  • Use actual Vercel commands and log queries
  • Include correlation ID search examples
  • Add specific error patterns to look for
  • Create a decision tree for common failures
  • Test the runbook by following it exactly

SLOs to define:

slos:
  acknowledgment:
    target: 99%
    threshold: 3000ms
    measurement: "Time to ack() Slack events"
  
  response_time:
    target: 95%
    threshold: 15000ms
    measurement: "Time from event to final response"
  
  error_rate:
    target: < 1%
    measurement: "Percentage of failed responses"
  
  availability:
    target: 99.9%
    measurement: "Bot responding to mentions"

Try It

  1. Create comprehensive runbook:

    /slack-agent/RUNBOOK.md
    # Slack Bot Operations Runbook
     
    ## Quick Reference
    - **Production URL**: https://slack-bot-prod.vercel.app
    - **Health Check**: https://slack-bot-prod.vercel.app/health
    - **Logs**: https://vercel.com/team/slack-bot-prod/functions
    - **On-call**: @oncall-slack-bot (PagerDuty)
     
    ## SLOs (Service Level Objectives)
     
    | Metric | Target | Threshold | Alert |
    |--------|--------|-----------|-------|
    | Event Acknowledgment | 99% | < 3s | PagerDuty High |
    | Response Time (p95) | 95% | < 15s | PagerDuty Low |
    | Error Rate | < 1% | - | PagerDuty Medium |
    | Availability | 99.9% | - | PagerDuty High |
     
    ## Common Issues Quick Fix
     
    ### Bot Not Responding
    1. Check health endpoint: `curl https://slack-bot-prod.vercel.app/health`
    2. Verify in Vercel dashboard: Functions tab → Check for errors
    3. Check Slack App config: Event Subscriptions → URL verified?
    4. If URL not verified: Redeploy with `pnpm dlx vercel --prod --force`
     
    ### Rate Limit Errors (429)
    1. Check logs for `rateLimitWaitMs` > 0
    2. Verify retry logic: `grep "retryAttempt" logs | tail -20`
    3. Temporary mitigation: Scale down concurrent requests
    4. Long-term: Implement request queuing
     
    ## Deployment Procedures
     
    ### Normal Deploy
    ```bash
    git pull origin main
    pnpm test
    pnpm dlx vercel --prod
    # Verify: curl https://slack-bot-prod.vercel.app/health

    Emergency Rollback

    # List recent deployments
    pnpm dlx vercel ls
     
    # Rollback to previous version
    pnpm dlx vercel rollback
     
    # Verify rollback
    curl https://slack-bot-prod.vercel.app/health
    # Check logs for normal operation

    Incident Response Flowchart

    ALERT FIRED
        ↓
    [Check Health] → Failed → [Check Vercel Status]
        ↓ OK                      ↓
    [Check Logs]              [Await Resolution]
        ↓
    [Correlation Search] 
        ↓
    Error Pattern?
      ├─ 429/Rate Limit → [Apply Backoff]
      ├─ 5xx/Timeout → [Check OpenAI Status]
      ├─ Missing Scope → [Update Manifest]
      └─ Unknown → [Escalate to Senior]
    

    Log Investigation Commands

    Find Recent Errors

    Filter: level:50 OR level:40
    Time: Last 1 hour
    

    Track Specific Request

    Filter: correlationId:"EVENT_ID_TIMESTAMP"
    Shows: Full request lifecycle
    

    Check AI Performance

    Filter: operation:respondToMessage
    Aggregate: AVG(latencyMs), MAX(retryAttempt)
    

    Secret Rotation

    1. Generate new token in Slack App Config
    2. Update in Vercel: pnpm dlx vercel env rm SLACK_BOT_TOKEN
    3. Add new: pnpm dlx vercel env add SLACK_BOT_TOKEN
    4. Redeploy: pnpm dlx vercel --prod --force
    5. Verify: Test bot mention in Slack

    Monitoring Setup

    Vercel Monitoring

    • Enable Monitoring in project settings
    • Set alert for Function errors > 1%
    • Set alert for Function duration > 10s (p95)

    Custom Health Checks

    • Endpoint: /health
    • Frequency: Every 60 seconds
    • Alert: 2 consecutive failures

    Incident Communication

    Status Updates

    • Initial: "#incidents - Investigating bot responsiveness issues"
    • Update: "#incidents - Identified rate limiting, applying fixes"
    • Resolution: "#incidents - Resolved, bot operating normally"

    Post-Mortem Template

    • Duration: Start time - End time
    • Impact: % of requests affected
    • Root Cause: Specific technical issue
    • Resolution: Steps taken
    • Prevention: Long-term fixes
  2. Simulate rate limit incident:

    /slack-agent/scripts/simulate-incident.ts
    // Trigger multiple rapid requests to hit rate limit
    for (let i = 0; i < 50; i++) {
      await client.chat.postMessage({
        channel: 'C_TEST_CHANNEL',
        text: `@bot test message ${i}`
      });
    }

    Expected logs:

    [WARN] bolt-app {
      correlationId: 'ev_1234_1733456789',
      operation: 'respondToMessage',
      error: 'rate_limited',
      retryAfter: 30000,
      retryAttempt: 1
    } Rate limited, waiting 30s
    
    [INFO] bolt-app {
      correlationId: 'ev_1234_1733456789',
      operation: 'respondToMessage', 
      retryAttempt: 2,
      rateLimitWaitMs: 30000,
      model: 'gpt-4o-mini'
    } Retry successful after backoff
    
  3. Follow runbook to resolve:

    # 1. Identify issue in logs
    pnpm dlx vercel logs --follow | grep "rate_limited"
     
    # 2. Check retry metrics
    # Filter: retryAttempt:>0
    # Shows: 47 requests with retries
     
    # 3. Verify backoff working
    # Filter: rateLimitWaitMs:>0
    # Shows: Proper exponential backoff applied
     
    # 4. Confirm resolution
    # Recent logs show normal operation
  4. Update incident log:

    /slack-agent/INCIDENTS.md
    ## 2024-12-06: Rate Limit Event
     
    **Duration**: 14:30 - 14:45 UTC (15 minutes)
    **Impact**: 47 messages delayed by 5-30 seconds
    **Root Cause**: Burst of 50 messages exceeded OpenAI rate limit
     
    **Timeline**:
    - 14:30 - Alert: Error rate spike to 94%
    - 14:31 - Identified rate_limited errors in logs
    - 14:32 - Confirmed retry logic engaging
    - 14:35 - Backoff successful, queue clearing
    - 14:45 - Normal operation restored
     
    **Resolution**: 
    - Automatic retry with exponential backoff handled issue
    - No manual intervention required
     
    **Action Items**:
    - [ ] Implement request queue to smooth bursts
    - [ ] Add pre-emptive rate limit monitoring
    - [ ] Update alerts to distinguish retriable errors

Troubleshooting

Runbook Not Helping:

  • Ensure it covers actual production scenarios
  • Add more specific error patterns from real incidents
  • Include actual commands that work, not theoretical ones

SLO Measurement Issues:

  • Use structured logs to calculate metrics accurately
  • Set up Vercel Analytics for automatic tracking
  • Consider external monitoring (Datadog, New Relic)

Incident Simulation Too Disruptive: Use a staging environment:

# Deploy to staging
pnpm dlx vercel --env preview
 
# Run tests against staging URL
SLACK_BOT_URL=https://staging.vercel.app npm run test:incident

Commit

git add -A
git commit -m "feat(ops): comprehensive operations runbook with incident procedures
 
- Create RUNBOOK.md with deployment and rollback procedures
- Define SLOs: 3s ack (p99), 15s response (p95), <1% errors
- Document incident response flowchart
- Add log investigation queries and commands
- Include secret rotation and monitoring setup
- Validate with simulated rate limit incident"

Done-When

  • RUNBOOK.md exists with all sections
  • SLOs defined with specific thresholds
  • Incident response flowchart documented
  • Rollback procedure tested and verified
  • Rate limit incident simulated and resolved

Solution

The complete runbook structure includes:

  1. Quick reference section with all critical URLs and contacts
  2. SLO definitions with specific metrics and thresholds
  3. Common issues with step-by-step fixes
  4. Deployment procedures including rollback
  5. Incident flowchart for systematic response
  6. Log queries for investigation
  7. Secret rotation procedures
  8. Monitoring setup instructions
  9. Communication templates for incidents

Key operational commands:

# Health check
curl https://slack-bot-prod.vercel.app/health
 
# View logs
pnpm dlx vercel logs --follow
 
# List deployments
pnpm dlx vercel ls
 
# Rollback
pnpm dlx vercel rollback
 
# Force redeploy
pnpm dlx vercel --prod --force

Key Takeaways

  • Runbooks must be tested regularly or they become fiction
  • SLOs should be measurable from your existing logs
  • Incident response is about systematic investigation, not heroes
  • Rollback capability is mandatory for production systems
  • Chaos engineering finds problems before your users do