When building an AI application, one of the biggest concerns is abuse – where bad actors would exploit your API endpoints and incur excessive usage costs for your application.
This guide is a comprehensive walkthrough on how you can set up rate limiting using the Vercel AI SDK and Vercel KV, allowing you to build powerful AI experiences with a peace of mind.
Rate limiting is a method used to regulate network traffic by defining a maximum number of requests that a client can send to a server within a given time frame.
- Maintain Service Availability: By implementing rate limiting, you can shield your services from being inundated with too many requests. This control over request volume helps in sustaining the peak performance of your service, guaranteeing its continuous availability.
- Manage Costs Effectively: Through rate limiting, you can keep a check on and regulate your billing expenses by averting unexpected surges in usage. This is particularly vital when dealing with services that bill per request.
- Safeguard Against Malicious Activities: Utilizing rate limiting is crucial when working with AI providers and Large Language Models (LLMs). It acts as a defense mechanism against malicious activities or misuse, such as DDoS assaults.
- Implement Usage Tiers Based on Subscription Plans: Rate limiting enables the establishment of different usage levels. For instance, free users may be restricted to a specific number of requests each day, whereas premium users may be granted a more generous limit.
Vercel's frontend cloud gives developers frameworks, workflows, and infrastructure to build a faster, more personalized web.
We are the creators of Next.js, the React framework, and have zero-configuration support for all major frontend frameworks.
The Vercel AI SDK is an open-source library designed to help developers build conversational streaming user interfaces in JavaScript and TypeScript.
With the Vercel AI SDK, you can build beautiful streaming experiences similar to ChatGPT in just a few lines of code.
Vercel KV is a globally distributed, durable key-value store.
- Durable: It offers an easy-to-use, durable data store that ensures your data is stored safely.
- Low latency: Being globally distributed, it offers low latency reads from anywhere in the world.
If you would prefer to start from a template instead of manually adding rate limiting to your AI project, we have created a Next.js template that uses Vercel AI SDK and Vercel KV (for handling rate limits).
Continue following the guide to manually add rate limiting to your existing application.
Create a new Vercel KV instance from the Vercel dashboard. Choose your primary region and additional read regions if desired. You can follow our quickstart if you prefer.
Connect your new KV database to your Vercel project. This will automatically add the required environment variables to connect to your new durable Redis database.
To simplify implementing rate limiting, we recommend @upstash/rate-limit
, which is a powerful HTTP-based rate limiting library with support for a variety of algorithms.
This library allows setting multiple rate limits based on logic, such as a user's plan. Futher, while the Vercel Edge Middleware is "hot," it will intelligently cache and reduce the number of calls to Vercel KV, helping prevent unnecessary usage of your database as well.
Check out the documentation to see all options for this library.
The Vercel AI SDK provides you with a few hooks and utilities that you can use to implement streaming chat experiences for your AI application.
Inside your Next.js application, create a page.tsx
file inside the App Router (app/
) and add the following code:
'use client'
import { useChat } from 'ai/react'import { toast } from 'sonner' // this can be any toast library of your choice
export default function Chat() { const { messages, input, handleInputChange, handleSubmit } = useChat({ onError: err => { toast.error(err.message) } })
return ( <div className="flex flex-col w-full max-w-md py-24 mx-auto stretch"> {messages.length > 0 ? messages.map(m => ( <div key={m.id} className="whitespace-pre-wrap"> {m.role === 'user' ? 'User: ' : 'AI: '} {m.content} </div> )) : null}
<form onSubmit={handleSubmit}> <input className="fixed bottom-0 w-full max-w-md p-2 mb-8 border border-gray-300 rounded shadow-xl" value={input} placeholder="Say something..." onChange={handleInputChange} /> </form> </div> )}
If you're using the Pages Router, you can follow the instructions here instead.
Then, create a Route Handler to stream in your chat response from OpenAI:
import { Configuration, OpenAIApi } from 'openai-edge'import { OpenAIStream, StreamingTextResponse } from 'ai'import { kv } from '@vercel/kv'import { Ratelimit } from '@upstash/ratelimit'
// Create an OpenAI API client (that's edge friendly!)const config = new Configuration({ apiKey: process.env.OPENAI_API_KEY})const openai = new OpenAIApi(config)
// Set the runtime to edge to allow for Edge Streaming (https://vercel.fyi/streaming)export const runtime = 'edge'
export async function POST(req: Request) { if (process.env.KV_REST_API_URL && process.env.KV_REST_API_TOKEN) { const ip = req.headers.get('x-forwarded-for') const ratelimit = new Ratelimit({ redis: kv, // rate limit to 5 requests per 10 seconds limiter: Ratelimit.slidingWindow(5, '10s') })
const { success, limit, reset, remaining } = await ratelimit.limit( `ratelimit_${ip}` )
if (!success) { return new Response('You have reached your request limit for the day.', { status: 429, headers: { 'X-RateLimit-Limit': limit.toString(), 'X-RateLimit-Remaining': remaining.toString(), 'X-RateLimit-Reset': reset.toString() } }) } } else { console.log("KV_REST_API_URL and KV_REST_API_TOKEN env vars not found, not rate limiting...") }
// Extract the `prompt` from the body of the request const { messages } = await req.json()
// Ask OpenAI for a streaming chat completion given the prompt const response = await openai.createChatCompletion({ model: 'gpt-3.5-turbo', stream: true, messages: messages.map((message: any) => ({ content: message.content, role: message.role })) })
// Convert the response into a friendly text-stream const stream = OpenAIStream(response) // Respond with the stream return new StreamingTextResponse(stream)}
In lines 17 - 39 in the code snippet above, we're using the @upstash/ratelimit
library and Vercel KV to handle requests to the chat endpoint. If a client exceeds the predefined rate limit (5 requests ever 10 seconds), they will get a 429
response with the error "You have reached your request limit for the day."
Securing AI applications doesn't have to be a daunting task. By implementing rate limiting with Vercel AI SDK and Vercel KV, developers can easily keep their services running smoothly and their costs in check. This guide has shown how simple it can be to set up these safeguards, whether you're starting from scratch or adding to an existing project. With tools like @upstash/rate-limit
and ready-made templates, you can build powerful AI experiences without losing sleep over potential abuse or unexpected bills. It's all about building smarter, not harder, and with these steps, you're well on your way.