Prevent runaway AI costs with pre-flight checks and token budgeting

After this lesson, you'll:

Reject requests costing >$0.10 BEFORE making them (watch logs block expensive queries)
See exact cost estimates: "Request would cost $0.1234" in logs
Switch to cheaper models automatically based on your budget thresholds

# Cost-aware request handling in logs:
[INFO] Pre-flight cost check {
  inputTokens: 2341,
  estimatedOutputTokens: 500,
  estimatedCost: 0.0007,
  model: 'gpt-4o-mini',
  status: 'approved'
}
 
[INFO] AI request completed {
  model: 'gpt-4o-mini',
  actualCost: 0.0006436,
  estimatedCost: 0.0007,
  accuracy: '92%'
}
 
# When requests are too expensive:
[WARN] Request rejected - cost limit exceeded {
  estimatedCost: 0.1234,
  limit: 0.10,
  inputTokens: 15234,
  message: 'Request too large - break into smaller questions'
}

What you'll build: Pre-flight cost checks + token estimation + automatic model switching + manual spend review.

Your bot crashes at 3 AM with "rate limit exceeded." By morning, you've burned through $500 in retries because nobody was watching. The solution isn't magic metrics—it's preventing expensive requests before they happen. This lesson teaches you to build guardrails that save money while you sleep.

Outcome

Implement cost controls that prevent expensive AI requests before they burn money, using token estimation and pre-flight checks.

What AI Gateway Actually Provides

Before building cost controls, understand what Gateway IS versus what it isn't:

AI Gateway is:

A unified API endpoint routing to 100+ models across providers
Automatic provider fallback when primary fails
Request/response passthrough with no markup on BYOK pricing
A dashboard showing: Requests by Model, TTFT, Token Counts, Spend (manual viewing only)

AI Gateway is NOT:

A metrics API you can query programmatically
A real-time alerting system
A budget enforcement tool
A way to get spend data into your app

What this means for production: You can't query Gateway metrics from code. But you CAN control costs by estimating token counts and rejecting expensive requests BEFORE making them. The AI SDK's usage field gives you actual costs after each request—store those yourself if you need historical tracking.

The production pattern: Pre-flight checks + your own tracking > hoping for magic metrics.

Fast Track

Build token estimation for pre-flight cost checks
Reject requests exceeding cost thresholds BEFORE making them
Log actual vs estimated costs to tune your estimates
Review Gateway dashboard manually for spend trends

Hands-On Exercise 4.5

Implement production cost controls without relying on non-existent metrics APIs:

Requirements:

Token Estimation for pre-flight cost checks:
- Estimate input tokens from message array
- Assume conservative output token count
- Calculate cost based on model pricing
Request Rejection based on estimated cost:
- Set per-request cost limit (e.g., $0.10)
- Reject before making expensive API calls
- Return user-friendly error messages
Model Switching based on manual budget thresholds:
- Environment variable for "cheap mode" toggle
- Use gpt-3.5-turbo when over budget
- Log when switching occurs
Cost Logging for post-request analysis:
- Log estimated vs actual costs
- Track accuracy of estimates over time
- Use for manual dashboard correlation

Implementation hints:

Open the AI settings via team selector: https://vercel.com/d?to=/[team]/~/ai/api-keys, then navigate to the AI Gateway section for your project
Gateway dashboard shows: Requests by Model, TTFT, Token Counts, Spend (no API access)
January 2025 pricing: gpt-4o-mini $0.00015 input, $0.0006 output per 1K tokens
AI SDK returns usage field with actual token counts after request completes

Try It

Test normal request with cost logging:

Ask the bot a simple question in a short thread
Check logs for pre-flight and post-request cost tracking:

[INFO] Pre-flight cost check {
  inputTokens: 2341,
  estimatedOutputTokens: 500,
  estimatedCost: 0.0007,
  model: 'gpt-4o-mini',
  status: 'approved'
}

[INFO] AI request completed {
  model: 'gpt-4o-mini',
  usage: {
    promptTokens: 2341,
    completionTokens: 487,
    totalTokens: 2828
  },
  actualCost: 0.0006436,
  estimatedCost: 0.0007,
  estimationAccuracy: '92%'
}

Test cost rejection with large context:
- Create a thread with 50+ messages
- Ask a question that would include all context
- Watch the pre-flight check reject it BEFORE making the request:
```
[WARN] Request rejected - cost limit exceeded {
  estimatedCost: 0.1234,
  limit: 0.10,
  inputTokens: 15234,
  estimatedOutputTokens: 500,
  model: 'gpt-4o-mini'
}
```
- Bot responds: "Your request is too large. Please break it into smaller questions. (Estimated cost: $0.1234)"

Test model switching with budget threshold:

Set FORCE_CHEAP_MODEL=true in .env to simulate over-budget state
Ask a question
Watch logs show gpt-3.5-turbo selection:

[WARN] Budget threshold triggered - using cheap model {
  reason: 'FORCE_CHEAP_MODEL environment variable set',
  selectedModel: 'gpt-3.5-turbo',
  normalModel: 'gpt-4o-mini'
}

Review Gateway dashboard manually:
- Navigate to https://vercel.com/[team]/[project]/ai-gateway
- Check "Requests by Model" chart
- Review "Spend" over time
- Compare dashboard spend with your logged costs
- Note: You're viewing this manually—no API access exists

Gateway Dashboard Access

The Gateway dashboard is read-only and manual. You can't query these metrics from code, but you can review them periodically to:

Verify your cost estimates are accurate
Spot unusual spending patterns
Compare different models' actual costs
Track TTFT trends over time

Commit

git add -A
git commit -m "feat(cost-control): add pre-flight checks and token budgeting for AI requests"

Done-When

Pre-flight cost checks reject expensive requests before making them
Token estimation calculates cost from message array
Model switching based on budget threshold (environment variable)
Actual vs estimated costs logged for accuracy tracking
User-friendly error messages when requests rejected

Solution

Create /slack-agent/server/lib/ai/cost-control.ts:

/slack-agent/server/lib/ai/cost-control.ts

import type { ModelMessage } from "ai";
 
/**
 * Estimate tokens for messages array using rough character-to-token ratio
 * Use this for pre-flight cost estimation
 *
 * Note: This is an approximation. Actual tokens depend on tokenizer.
 * Rule of thumb: 1 token ≈ 4 characters for English text.
 * In production, use a real tokenizer like OpenAI's tiktoken for more accurate estimates:
 * https://github.com/openai/tiktoken
 */
export function estimateTokens(messages: ModelMessage[]): number {
  return messages.reduce((sum, msg) => {
    const content = typeof msg.content === 'string'
      ? msg.content
      : JSON.stringify(msg.content);
    // Rough estimate: 1 token per 4 characters
    return sum + Math.ceil(content.length / 4);
  }, 0);
}
 
/**
 * Calculate request cost before making it
 * Use for budget decisions and request rejection
 *
 * January 2025 pricing (per 1K tokens):
 * - gpt-4o-mini: $0.00015 input, $0.0006 output
 * - gpt-3.5-turbo: $0.0005 input, $0.0015 output
 * - gpt-4o: $0.0025 input, $0.01 output
 */
export function estimateRequestCost(
  inputTokens: number,
  outputTokens: number,
  model: string
): number {
  const costs: Record<string, { input: number; output: number }> = {
    'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
    'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
    'gpt-4o': { input: 0.0025, output: 0.01 },
  };
 
  const modelCost = costs[model] || costs['gpt-3.5-turbo'];
  return (
    (inputTokens / 1000) * modelCost.input +
    (outputTokens / 1000) * modelCost.output
  );
}
 
/**
 * Calculate actual cost from AI SDK usage response
 * Use to compare estimated vs actual and tune your estimates
 */
export function calculateActualCost(
  promptTokens: number,
  completionTokens: number,
  model: string
): number {
  return estimateRequestCost(promptTokens, completionTokens, model);
}
 
/**
 * Pre-flight check: Should we reject this request due to cost?
 * Reject obviously expensive requests BEFORE they burn money
 */
export function shouldRejectRequest(estimatedCost: number): {
  reject: boolean;
  reason?: string;
} {
  const MAX_REQUEST_COST = 0.10; // $0.10 per request cap
 
  if (estimatedCost > MAX_REQUEST_COST) {
    return {
      reject: true,
      reason: `Request would cost $${estimatedCost.toFixed(4)}, exceeds $${MAX_REQUEST_COST} limit`
    };
  }
 
  return { reject: false };
}
 
/**
 * Check if we should use cheap mode based on environment variable
 * In production, set this based on your own budget tracking
 */
export function shouldUseCheapModel(): boolean {
  return process.env.FORCE_CHEAP_MODEL === 'true';
}

Update /slack-agent/server/lib/ai/respond-to-message.ts:

/slack-agent/server/lib/ai/respond-to-message.ts

import type { KnownEventFromType } from "@slack/bolt";
import { generateText, type ModelMessage, stepCountIs } from "ai";
import { app } from "~/app";
import {
  calculateActualCost,
  estimateRequestCost,
  estimateTokens,
  shouldRejectRequest,
  shouldUseCheapModel,
} from "./cost-control";
import { simulateFailure, withRetry } from "./retry-wrapper";
// ... rest of imports ...
 
export const respondToMessage = async ({
  messages,
  event,
  channel,
  thread_ts,
  botId,
  correlation,
}: RespondToMessageOptions) => {
  // Pre-flight cost estimation - catch expensive requests BEFORE they cost money
  const inputTokens = estimateTokens(messages);
  const estimatedOutputTokens = 500; // Conservative estimate for response
  const primaryModel = "gpt-4o-mini";
  const estimatedCost = estimateRequestCost(
    inputTokens,
    estimatedOutputTokens,
    primaryModel
  );
 
  app.logger.info("Pre-flight cost check", {
    ...correlation,
    inputTokens,
    estimatedOutputTokens,
    estimatedCost,
    model: primaryModel,
    status: estimatedCost > 0.10 ? "rejected" : "approved",
  });
 
  // Reject expensive requests BEFORE making them
  const rejectCheck = shouldRejectRequest(estimatedCost);
  if (rejectCheck.reject) {
    app.logger.warn("Request rejected - cost limit exceeded", {
      ...correlation,
      estimatedCost,
      limit: 0.1,
      inputTokens,
      estimatedOutputTokens,
      model: primaryModel,
    });
 
    // Return user-friendly error instead of making expensive request
    return `Your request is too large. Please break it into smaller questions or provide less context. (Estimated cost: $${estimatedCost.toFixed(4)})`;
  }
 
  // Model selection with budget awareness
  // In production, set FORCE_CHEAP_MODEL based on your own budget tracking
  const useCheapMode = shouldUseCheapModel();
  const models = useCheapMode
    ? ["gpt-3.5-turbo"] // Force cheap model only
    : ["gpt-4o-mini", "gpt-3.5-turbo"];
 
  if (useCheapMode) {
    app.logger.warn("Budget threshold triggered - using cheap model", {
      ...correlation,
      reason: "FORCE_CHEAP_MODEL environment variable set",
      selectedModel: "gpt-3.5-turbo",
      normalModel: "gpt-4o-mini",
    });
  }
 
  let lastError: unknown;
 
  for (const currentModel of models) {
    try {
      const result = await withRetry(
        async () => {
          app.logger.info("Attempting AI request", {
            ...correlation,
            model: currentModel,
            inputTokens,
            estimatedCost,
          });
 
          return await generateText({
            model: currentModel,
            system: `You are Slack Agent, a helpful assistant in Slack.
            // ... rest of system prompt ...
            `,
            messages,
            // ... rest of config ...
          });
        },
        {
          maxRetries: 3,
          initialDelayMs: 1000,
        }
      );
 
      // Log actual cost after request completes
      const actualCost = calculateActualCost(
        result.usage.promptTokens,
        result.usage.completionTokens,
        currentModel
      );
 
      const estimationAccuracy = ((actualCost / estimatedCost) * 100).toFixed(0);
 
      app.logger.info("AI request completed", {
        ...correlation,
        model: currentModel,
        usage: {
          promptTokens: result.usage.promptTokens,
          completionTokens: result.usage.completionTokens,
          totalTokens: result.usage.totalTokens,
        },
        actualCost,
        estimatedCost,
        estimationAccuracy: `${estimationAccuracy}%`,
      });
 
      return result.text;
    } catch (error) {
      // ... existing error handling ...
      lastError = error;
    }
  }
 
  throw lastError;
};

Screenshot placeholders:

[TODO: Add screenshot of Gateway dashboard showing "Requests by Model" chart] [TODO: Add screenshot of Gateway dashboard showing "Spend" over time] [TODO: Add screenshot of Gateway dashboard showing "TTFT" metrics]

Why No /gateway-status Command?

Earlier versions of AI Gateway documentation suggested programmatic metrics access. As of January 2025, Gateway provides a manual dashboard only—no API exists to query metrics from your code.

Building a /gateway-status command that shows real-time metrics would require:

Storing usage data from every AI SDK response in your database
Aggregating that data yourself
Calculating costs based on stored token counts

This is a valid production pattern, but it's your own tracking system, not Gateway API integration. The lesson focuses on the more immediate value: preventing expensive requests before they happen.

Building on Previous Lessons

This lesson applies cost awareness to everything we've built:

From Repository Flyover: You saw how the bot fetches context—now you'll optimize token usage for those contexts
From system prompts: System prompts stay the same across budget-aware model selection
From status communication: Status updates can inform users when using cheaper models
From error handling: Retry logic combined with cost checks prevents retry storms burning money
From Bolt Middleware: Correlation-style logging makes pre-flight and post-request cost logs queryable in production
Production reality: You can't query Gateway metrics, but you CAN prevent expensive requests—the more valuable pattern

What's Next

Section 5 covers deployment to Vercel and production operations—taking your cost-controlled bot live in Deploy to Vercel.