Build Your Operations Runbook So Anyone Can Fix Outages
Your bot just crashed during the quarterly board meeting demo. The on-call engineer doesn't know TypeScript. They need a step-by-step guide to diagnose, mitigate, and resolve the incident. Without a runbook, they're guessing. With one, they're following a proven playbook that gets the bot back online in minutes, not hours.
Outcome
Create a comprehensive RUNBOOK.md with SLOs, incident procedures, and simulate a production incident to validate your response process.
Fast Track
- Create
RUNBOOK.mdwith sections: Setup, Secrets, Deploy, Incidents, Rollback - Define SLOs: ack < 3s (p99), response < 15s (p95), error rate < 1%
- Simulate a rate limit incident and follow the runbook to resolution
Building on Previous Lessons
Your runbook leverages everything we've built:
- From error handling: Retry logic and rate limit handling procedures
- From deploy to Vercel: Deployment and rollback commands
- From structured logs: Structured logs for incident investigation
Hands-On Exercise 5.3
Create an operations runbook and validate it with incident simulation:
Requirements:
- Create
RUNBOOK.mdwith all operational procedures - Define SLOs with specific thresholds
- Document incident response flowchart
- Include rollback procedures with verification steps
- Simulate a rate limit incident and resolve using the runbook
Implementation hints:
- Use actual Vercel commands and log queries
- Include correlation ID search examples
- Add specific error patterns to look for
- Create a decision tree for common failures
- Test the runbook by following it exactly
SLOs to define:
slos:
acknowledgment:
target: 99%
threshold: 3000ms
measurement: "Time to ack() Slack events"
response_time:
target: 95%
threshold: 15000ms
measurement: "Time from event to final response"
error_rate:
target: < 1%
measurement: "Percentage of failed responses"
availability:
target: 99.9%
measurement: "Bot responding to mentions"Try It
-
Create comprehensive runbook:
/slack-agent/RUNBOOK.md# Slack Bot Operations Runbook ## Quick Reference - **Production URL**: https://slack-bot-prod.vercel.app - **Health Check**: https://slack-bot-prod.vercel.app/health - **Logs**: https://vercel.com/team/slack-bot-prod/functions - **On-call**: @oncall-slack-bot (PagerDuty) ## SLOs (Service Level Objectives) | Metric | Target | Threshold | Alert | |--------|--------|-----------|-------| | Event Acknowledgment | 99% | < 3s | PagerDuty High | | Response Time (p95) | 95% | < 15s | PagerDuty Low | | Error Rate | < 1% | - | PagerDuty Medium | | Availability | 99.9% | - | PagerDuty High | ## Common Issues Quick Fix ### Bot Not Responding 1. Check health endpoint: `curl https://slack-bot-prod.vercel.app/health` 2. Verify in Vercel dashboard: Functions tab → Check for errors 3. Check Slack App config: Event Subscriptions → URL verified? 4. If URL not verified: Redeploy with `pnpm dlx vercel --prod --force` ### Rate Limit Errors (429) 1. Check logs for `rateLimitWaitMs` > 0 2. Verify retry logic: `grep "retryAttempt" logs | tail -20` 3. Temporary mitigation: Scale down concurrent requests 4. Long-term: Implement request queuing ## Deployment Procedures ### Normal Deploy ```bash git pull origin main pnpm test pnpm dlx vercel --prod # Verify: curl https://slack-bot-prod.vercel.app/healthEmergency Rollback
# List recent deployments pnpm dlx vercel ls # Rollback to previous version pnpm dlx vercel rollback # Verify rollback curl https://slack-bot-prod.vercel.app/health # Check logs for normal operationIncident Response Flowchart
ALERT FIRED ↓ [Check Health] → Failed → [Check Vercel Status] ↓ OK ↓ [Check Logs] [Await Resolution] ↓ [Correlation Search] ↓ Error Pattern? ├─ 429/Rate Limit → [Apply Backoff] ├─ 5xx/Timeout → [Check OpenAI Status] ├─ Missing Scope → [Update Manifest] └─ Unknown → [Escalate to Senior]Log Investigation Commands
Find Recent Errors
Filter: level:50 OR level:40 Time: Last 1 hourTrack Specific Request
Filter: correlationId:"EVENT_ID_TIMESTAMP" Shows: Full request lifecycleCheck AI Performance
Filter: operation:respondToMessage Aggregate: AVG(latencyMs), MAX(retryAttempt)Secret Rotation
- Generate new token in Slack App Config
- Update in Vercel:
pnpm dlx vercel env rm SLACK_BOT_TOKEN - Add new:
pnpm dlx vercel env add SLACK_BOT_TOKEN - Redeploy:
pnpm dlx vercel --prod --force - Verify: Test bot mention in Slack
Monitoring Setup
Vercel Monitoring
- Enable Monitoring in project settings
- Set alert for Function errors > 1%
- Set alert for Function duration > 10s (p95)
Custom Health Checks
- Endpoint:
/health - Frequency: Every 60 seconds
- Alert: 2 consecutive failures
Incident Communication
Status Updates
- Initial: "#incidents - Investigating bot responsiveness issues"
- Update: "#incidents - Identified rate limiting, applying fixes"
- Resolution: "#incidents - Resolved, bot operating normally"
Post-Mortem Template
- Duration: Start time - End time
- Impact: % of requests affected
- Root Cause: Specific technical issue
- Resolution: Steps taken
- Prevention: Long-term fixes
-
Simulate rate limit incident:
/slack-agent/scripts/simulate-incident.ts// Trigger multiple rapid requests to hit rate limit for (let i = 0; i < 50; i++) { await client.chat.postMessage({ channel: 'C_TEST_CHANNEL', text: `@bot test message ${i}` }); }Expected logs:
[WARN] bolt-app { correlationId: 'ev_1234_1733456789', operation: 'respondToMessage', error: 'rate_limited', retryAfter: 30000, retryAttempt: 1 } Rate limited, waiting 30s [INFO] bolt-app { correlationId: 'ev_1234_1733456789', operation: 'respondToMessage', retryAttempt: 2, rateLimitWaitMs: 30000, model: 'gpt-4o-mini' } Retry successful after backoff -
Follow runbook to resolve:
# 1. Identify issue in logs pnpm dlx vercel logs --follow | grep "rate_limited" # 2. Check retry metrics # Filter: retryAttempt:>0 # Shows: 47 requests with retries # 3. Verify backoff working # Filter: rateLimitWaitMs:>0 # Shows: Proper exponential backoff applied # 4. Confirm resolution # Recent logs show normal operation -
Update incident log:
/slack-agent/INCIDENTS.md## 2024-12-06: Rate Limit Event **Duration**: 14:30 - 14:45 UTC (15 minutes) **Impact**: 47 messages delayed by 5-30 seconds **Root Cause**: Burst of 50 messages exceeded OpenAI rate limit **Timeline**: - 14:30 - Alert: Error rate spike to 94% - 14:31 - Identified rate_limited errors in logs - 14:32 - Confirmed retry logic engaging - 14:35 - Backoff successful, queue clearing - 14:45 - Normal operation restored **Resolution**: - Automatic retry with exponential backoff handled issue - No manual intervention required **Action Items**: - [ ] Implement request queue to smooth bursts - [ ] Add pre-emptive rate limit monitoring - [ ] Update alerts to distinguish retriable errors
Troubleshooting
Runbook Not Helping:
- Ensure it covers actual production scenarios
- Add more specific error patterns from real incidents
- Include actual commands that work, not theoretical ones
SLO Measurement Issues:
- Use structured logs to calculate metrics accurately
- Set up Vercel Analytics for automatic tracking
- Consider external monitoring (Datadog, New Relic)
Incident Simulation Too Disruptive: Use a staging environment:
# Deploy to staging
pnpm dlx vercel --env preview
# Run tests against staging URL
SLACK_BOT_URL=https://staging.vercel.app npm run test:incidentCommit
git add -A
git commit -m "feat(ops): comprehensive operations runbook with incident procedures
- Create RUNBOOK.md with deployment and rollback procedures
- Define SLOs: 3s ack (p99), 15s response (p95), <1% errors
- Document incident response flowchart
- Add log investigation queries and commands
- Include secret rotation and monitoring setup
- Validate with simulated rate limit incident"Done-When
RUNBOOK.mdexists with all sections- SLOs defined with specific thresholds
- Incident response flowchart documented
- Rollback procedure tested and verified
- Rate limit incident simulated and resolved
Solution
The complete runbook structure includes:
- Quick reference section with all critical URLs and contacts
- SLO definitions with specific metrics and thresholds
- Common issues with step-by-step fixes
- Deployment procedures including rollback
- Incident flowchart for systematic response
- Log queries for investigation
- Secret rotation procedures
- Monitoring setup instructions
- Communication templates for incidents
Key operational commands:
# Health check
curl https://slack-bot-prod.vercel.app/health
# View logs
pnpm dlx vercel logs --follow
# List deployments
pnpm dlx vercel ls
# Rollback
pnpm dlx vercel rollback
# Force redeploy
pnpm dlx vercel --prod --forceKey Takeaways
- Runbooks must be tested regularly or they become fiction
- SLOs should be measurable from your existing logs
- Incident response is about systematic investigation, not heroes
- Rollback capability is mandatory for production systems
- Chaos engineering finds problems before your users do
Was this helpful?