Build Your Operations Runbook So Anyone Can Fix Outages

Your bot just crashed during the quarterly board meeting demo. The on-call engineer doesn't know TypeScript. They need a step-by-step guide to diagnose, mitigate, and resolve the incident. Without a runbook, they're guessing. With one, they're following a proven playbook that gets the bot back online in minutes, not hours.

Outcome

Create a comprehensive RUNBOOK.md with SLOs, incident procedures, and simulate a production incident to validate your response process.

Fast Track

Create RUNBOOK.md with sections: Setup, Secrets, Deploy, Incidents, Rollback
Define SLOs: ack < 3s (p99), response < 15s (p95), error rate < 1%
Simulate a rate limit incident and follow the runbook to resolution

Building on Previous Lessons

Your runbook leverages everything we've built:

From error handling: Retry logic and rate limit handling procedures
From deploy to Vercel: Deployment and rollback commands
From structured logs: Structured logs for incident investigation

Hands-On Exercise 5.3

Create an operations runbook and validate it with incident simulation:

Requirements:

Create RUNBOOK.md with all operational procedures
Define SLOs with specific thresholds
Document incident response flowchart
Include rollback procedures with verification steps
Simulate a rate limit incident and resolve using the runbook

Implementation hints:

Use actual Vercel commands and log queries
Include correlation ID search examples
Add specific error patterns to look for
Create a decision tree for common failures
Test the runbook by following it exactly

SLOs to define:

slos:
  acknowledgment:
    target: 99%
    threshold: 3000ms
    measurement: "Time to ack() Slack events"
  
  response_time:
    target: 95%
    threshold: 15000ms
    measurement: "Time from event to final response"
  
  error_rate:
    target: < 1%
    measurement: "Percentage of failed responses"
  
  availability:
    target: 99.9%
    measurement: "Bot responding to mentions"

Try It

Create comprehensive runbook:

/slack-agent/RUNBOOK.md

# Slack Bot Operations Runbook
 
## Quick Reference
- **Production URL**: https://slack-bot-prod.vercel.app
- **Health Check**: https://slack-bot-prod.vercel.app/health
- **Logs**: https://vercel.com/team/slack-bot-prod/functions
- **On-call**: @oncall-slack-bot (PagerDuty)
 
## SLOs (Service Level Objectives)
 
| Metric | Target | Threshold | Alert |
|--------|--------|-----------|-------|
| Event Acknowledgment | 99% | < 3s | PagerDuty High |
| Response Time (p95) | 95% | < 15s | PagerDuty Low |
| Error Rate | < 1% | - | PagerDuty Medium |
| Availability | 99.9% | - | PagerDuty High |
 
## Common Issues Quick Fix
 
### Bot Not Responding
1. Check health endpoint: `curl https://slack-bot-prod.vercel.app/health`
2. Verify in Vercel dashboard: Functions tab → Check for errors
3. Check Slack App config: Event Subscriptions → URL verified?
4. If URL not verified: Redeploy with `pnpm dlx vercel --prod --force`
 
### Rate Limit Errors (429)
1. Check logs for `rateLimitWaitMs` > 0
2. Verify retry logic: `grep "retryAttempt" logs | tail -20`
3. Temporary mitigation: Scale down concurrent requests
4. Long-term: Implement request queuing
 
## Deployment Procedures
 
### Normal Deploy
```bash
git pull origin main
pnpm test
pnpm dlx vercel --prod
# Verify: curl https://slack-bot-prod.vercel.app/health

Emergency Rollback

# List recent deployments
pnpm dlx vercel ls
 
# Rollback to previous version
pnpm dlx vercel rollback
 
# Verify rollback
curl https://slack-bot-prod.vercel.app/health
# Check logs for normal operation

Incident Response Flowchart

ALERT FIRED
    ↓
[Check Health] → Failed → [Check Vercel Status]
    ↓ OK                      ↓
[Check Logs]              [Await Resolution]
    ↓
[Correlation Search] 
    ↓
Error Pattern?
  ├─ 429/Rate Limit → [Apply Backoff]
  ├─ 5xx/Timeout → [Check OpenAI Status]
  ├─ Missing Scope → [Update Manifest]
  └─ Unknown → [Escalate to Senior]

Log Investigation Commands

Find Recent Errors

Filter: level:50 OR level:40
Time: Last 1 hour

Track Specific Request

Filter: correlationId:"EVENT_ID_TIMESTAMP"
Shows: Full request lifecycle

Check AI Performance

Filter: operation:respondToMessage
Aggregate: AVG(latencyMs), MAX(retryAttempt)

Secret Rotation

Generate new token in Slack App Config
Update in Vercel: pnpm dlx vercel env rm SLACK_BOT_TOKEN
Add new: pnpm dlx vercel env add SLACK_BOT_TOKEN
Redeploy: pnpm dlx vercel --prod --force
Verify: Test bot mention in Slack

Monitoring Setup

Vercel Monitoring

Enable Monitoring in project settings
Set alert for Function errors > 1%
Set alert for Function duration > 10s (p95)

Custom Health Checks

Endpoint: /health
Frequency: Every 60 seconds
Alert: 2 consecutive failures

Incident Communication

Status Updates

Initial: "#incidents - Investigating bot responsiveness issues"
Update: "#incidents - Identified rate limiting, applying fixes"
Resolution: "#incidents - Resolved, bot operating normally"

Post-Mortem Template

Duration: Start time - End time
Impact: % of requests affected
Root Cause: Specific technical issue
Resolution: Steps taken
Prevention: Long-term fixes

Simulate rate limit incident:

/slack-agent/scripts/simulate-incident.ts

// Trigger multiple rapid requests to hit rate limit
for (let i = 0; i < 50; i++) {
  await client.chat.postMessage({
    channel: 'C_TEST_CHANNEL',
    text: `@bot test message ${i}`
  });
}

Expected logs:

[WARN] bolt-app {
  correlationId: 'ev_1234_1733456789',
  operation: 'respondToMessage',
  error: 'rate_limited',
  retryAfter: 30000,
  retryAttempt: 1
} Rate limited, waiting 30s

[INFO] bolt-app {
  correlationId: 'ev_1234_1733456789',
  operation: 'respondToMessage', 
  retryAttempt: 2,
  rateLimitWaitMs: 30000,
  model: 'gpt-4o-mini'
} Retry successful after backoff

Follow runbook to resolve:

# 1. Identify issue in logs
pnpm dlx vercel logs --follow | grep "rate_limited"
 
# 2. Check retry metrics
# Filter: retryAttempt:>0
# Shows: 47 requests with retries
 
# 3. Verify backoff working
# Filter: rateLimitWaitMs:>0
# Shows: Proper exponential backoff applied
 
# 4. Confirm resolution
# Recent logs show normal operation

Update incident log:

/slack-agent/INCIDENTS.md

## 2024-12-06: Rate Limit Event
 
**Duration**: 14:30 - 14:45 UTC (15 minutes)
**Impact**: 47 messages delayed by 5-30 seconds
**Root Cause**: Burst of 50 messages exceeded OpenAI rate limit
 
**Timeline**:
- 14:30 - Alert: Error rate spike to 94%
- 14:31 - Identified rate_limited errors in logs
- 14:32 - Confirmed retry logic engaging
- 14:35 - Backoff successful, queue clearing
- 14:45 - Normal operation restored
 
**Resolution**: 
- Automatic retry with exponential backoff handled issue
- No manual intervention required
 
**Action Items**:
- [ ] Implement request queue to smooth bursts
- [ ] Add pre-emptive rate limit monitoring
- [ ] Update alerts to distinguish retriable errors

Troubleshooting

Runbook Not Helping:

Ensure it covers actual production scenarios
Add more specific error patterns from real incidents
Include actual commands that work, not theoretical ones

SLO Measurement Issues:

Use structured logs to calculate metrics accurately
Set up Vercel Analytics for automatic tracking
Consider external monitoring (Datadog, New Relic)

Incident Simulation Too Disruptive: Use a staging environment:

# Deploy to staging
pnpm dlx vercel --env preview
 
# Run tests against staging URL
SLACK_BOT_URL=https://staging.vercel.app npm run test:incident

Commit

git add -A
git commit -m "feat(ops): comprehensive operations runbook with incident procedures
 
- Create RUNBOOK.md with deployment and rollback procedures
- Define SLOs: 3s ack (p99), 15s response (p95), <1% errors
- Document incident response flowchart
- Add log investigation queries and commands
- Include secret rotation and monitoring setup
- Validate with simulated rate limit incident"

Done-When

RUNBOOK.md exists with all sections
SLOs defined with specific thresholds
Incident response flowchart documented
Rollback procedure tested and verified
Rate limit incident simulated and resolved

Solution

The complete runbook structure includes:

Quick reference section with all critical URLs and contacts
SLO definitions with specific metrics and thresholds
Common issues with step-by-step fixes
Deployment procedures including rollback
Incident flowchart for systematic response
Log queries for investigation
Secret rotation procedures
Monitoring setup instructions
Communication templates for incidents

Key operational commands:

# Health check
curl https://slack-bot-prod.vercel.app/health
 
# View logs
pnpm dlx vercel logs --follow
 
# List deployments
pnpm dlx vercel ls
 
# Rollback
pnpm dlx vercel rollback
 
# Force redeploy
pnpm dlx vercel --prod --force

Key Takeaways

Runbooks must be tested regularly or they become fiction
SLOs should be measurable from your existing logs
Incident response is about systematic investigation, not heroes
Rollback capability is mandatory for production systems
Chaos engineering finds problems before your users do