What is Retrieval Augmented Generation (RAG)

Let's say you are a developer for AcmeGadgets, a fictitious company that sells smart home devices (smart thermostats, lights, and speakers). Through reviewing user paths, you find that customers often get lost in your documentation pages.

You have a few options to help customers find answers faster. You could reorganize your documentation structure, improve search functionality, or add more links between related articles. However, these solutions would still require customers to find the information based on your organization.

Another approach is to bring the answers directly to your customers through an AI-powered support chatbot. This is where RAG becomes valuable as it allows LLMs to provide accurate, context-specific answers by retrieving relevant information from your documentation before generating a response.

In this guide, you'll learn how to build a support chatbot powered by RAG that searches through product documentation, retrieves the most relevant information, and delivers accurate answers based on your actual content.

Calling the LLM directly

From the chat UI, you can wrap the user question in a prompt and send it directly to your LLM of choice. The LLM may have access to a version of AcmeGadgets' documentation in its training data. However, this would not be the most recent and the returned answers can either be inaccurate or the LLM can hallucinate (generate information that sounds plausible but is incorrect) if it does not find an answer in this older version of the documentation.

Example chat with direct call

User asks: My Acme Thermo X100 thermostat keeps disconnecting from Wi-Fi. What should I do?

Standard chatbot replies: Try resetting your thermostat by unplugging it for 30 seconds, then plugging it back in. Make sure it’s running the latest firmware. You can usually update it in the app. If the issue continues, it may be a problem with your internet provider or the thermostat’s internal antenna. You might need to replace the device.

Problems with the answer:

The instructions about "unplugging it for 30 seconds" and "updating the firmware" are not part of any existing documentation. The model extrapolated based on its general knowledge of this kind of troubleshooting.
It did not provide any reference to a guide or manual
Mentioning "replacing the device" in the first answer may cause frustration.

Including the complete documentation

You can provide the LLM with the most recent context by including the text data of the complete documentation for AcmeGadgets in the prompt.

This will likely exceed the LLM context window, which is the maximum amount of text an LLM can process in a single request. Even models with large context windows (like those supporting 128k tokens) struggle with complete documentation sets, which can easily contain millions of tokens.

If your documentation does fit within the context window, processing it for every user question becomes prohibitively expensive and slow.

Adding RAG to the process

With RAG, you add 2 steps before sending the prompt to the LLM:

Retrieval step

You first search product manuals, FAQs, warranty docs, and troubleshooting guides. This uses semantic search to find the most relevant documentation chunks. For example, when a user asks about "Wi-Fi disconnection issues", the system searches your vector database and retrieves only the documentation sections that match the query's meaning, not just keyword matches, but conceptually related content.

The most widely used algorithm for determining relevancy is cosine similarity, which measures the angle between two vectors in high-dimensional space. A cosine similarity score close to 1 indicates high relevance, while scores near 0 suggest low relevance. Other common algorithms include Euclidean distance (measuring the straight-line distance between vectors) and dot product similarity (combining both magnitude and direction). Most vector databases use cosine similarity as the default because it handles varying document lengths well and focuses purely on semantic meaning.

Augmented generation step

You craft an augmented prompt using the most relevant results and the user question. In this step, you combine the retrieved context with the user's original question into a structured prompt. This prompt instructs the LLM to answer based solely on the provided documentation, reducing hallucinations and ensuring accuracy. The LLM then generates a response grounded in your actual product documentation rather than its general training data.

Example chat with RAG

User asks: My Acme Thermo X100 thermostat keeps disconnecting from Wi-Fi. What should I do?

RAG-powered chatbot replies:

According to the Acme Thermo X100 Setup Guide (page 12), Wi-Fi disconnections are often caused by weak signal strength.

Try moving your router within 10 meters of the thermostat.
Make sure your Wi-Fi is on 2.4 GHz (the X100 doesn’t support 5 GHz).
Restart the thermostat by holding the side button for 5 seconds.

If the problem continues, see the full troubleshooting guide here. (This would be linked automatically to the related source)

Common RAG workflow

The following outlines how you would implement RAG for the AcmeGadgets support chatbot.

Vector storage system

Collect data: Gather all AcmeGadgets documentation from different sources such as the Thermo X100 setup PDF, troubleshooting guides for smart lights, warranty documents, HTML FAQ pages, and speaker installation manuals.
Chunk & embed: Break down the documentation into chunks that balance context and focus. Each chunk should contain enough information for the LLM to understand the topic without including multiple unrelated concepts. Chunks that are too small lack context, while chunks that are too large dilute search relevance by mixing different topics together. For example, the "Wi-Fi connectivity troubleshooting" section from the Thermo X100 manual becomes one chunk, while the "Initial setup" section becomes another. Convert each chunk to vector embeddings that capture their semantic meaning.
Store in a vector DB: Save all AcmeGadgets documentation embeddings in a vector database. Set up an automated process that updates the database whenever the product team publishes new manuals or updates existing documentation.

Client chat system

Build retrieval layer: When an AcmeGadgets customer asks "My Thermo X100 keeps disconnecting from Wi-Fi," the system searches the vector database and retrieves the top 3-5 most relevant chunks. It will pull from the Thermo X100 troubleshooting guide, Wi-Fi setup instructions, and related FAQ entries.
Augment LLM prompt: Combine the retrieved documentation chunks with the customer's question into a structured prompt. Instruct the LLM to answer based only on the provided AcmeGadgets documentation, ensuring responses reference actual product manuals rather than generic troubleshooting advice.
Respond to user: The chatbot delivers a clear, actionable answer with specific steps from the Thermo X100 manual, includes the source reference (page 12 of the setup guide), and provides a link to the full troubleshooting documentation.

This is an example api route using the AI SDK with a tool call to a function called retrieveDocumentation that finds the most relevant chunks from the documentation database based on the user's query.

app/api/chat/route.ts

import { retrieveDocumentation } from '@/lib/actions/documentation';
import { openai } from '@ai-sdk/openai';
import { convertToModelMessages, streamText, tool } from 'ai';
import { z } from 'zod';

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-5'),
    system: `You are a helpful AcmeGadgets support assistant. 
    Check the product documentation before answering any questions.
    Only respond to questions using information from tool calls.
    If no relevant information is found, respond "I don't have specific information about that in our documentation."`,
    messages: convertToModelMessages(messages),
    tools: {
      retrieveDocumentation: tool({
        description: 'Search AcmeGadgets product documentation for relevant information about setup, troubleshooting, and product features.',
        inputSchema: z.object({
          query: z.string().describe('the search query to find relevant documentation'),
        }),
        execute: async ({ query }) => retrieveDocumentation({ query }),
      }),
    },
  });

  return result.toUIMessageStreamResponse();
}

Optimizing your RAG implementation

Filter results with similarity thresholds

When retrieving documentation chunks, filter out results that don't meet a minimum relevance score. This prevents the LLM from receiving unrelated content that could confuse the response.

/lib/actions/documentation.ts

const relevantChunks = searchResults.filter(result => 
  result.similarity > 0.7  // Only include chunks with 70%+ similarity
);

For the AcmeGadgets chatbot, if a customer asks about the Thermo X100 but the highest similarity score is 0.4, the system should respond with "I don't have specific information about that in our documentation" rather than forcing an answer from irrelevant chunks.

Limit context size

Retrieve only the top 3-5 most relevant chunks to keep responses focused and avoid overwhelming the LLM with too much information. More chunks don't always lead to better answers.

/lib/actions/documentation.ts

const topChunks = relevantChunks
  .sort((a, b) => b.similarity - a.similarity)
  .slice(0, 5);  // Take only top 5 results

Structure your source data

Before chunking your documentation, clean and structure it properly. Remove navigation elements, headers, footers, and outdated content. For AcmeGadgets, this means extracting only the instructional content from PDFs and removing marketing copy or deprecated product information.

Each chunk should be self-contained enough to be understood on its own. Include relevant context like product names and section headings within the chunk:

/lib/actions/documentation.ts

const chunk = {
  content: "Acme Thermo X100 - Wi-Fi Troubleshooting: If your thermostat disconnects frequently, check that your router uses 2.4 GHz...",
  metadata: {
    source: "thermo-x100-manual.pdf",
    page: 12,
    section: "Wi-Fi Troubleshooting"
  }
};

Use overlapping chunks

When splitting documentation, overlap chunks by 50-100 tokens to maintain context across boundaries. This prevents important information from being split awkwardly between chunks.

/lib/actions/documentation.ts

const chunkSize = 500;
const overlapSize = 50;

for (let i = 0; i < tokens.length; i += chunkSize - overlapSize) {
  const chunk = tokens.slice(i, i + chunkSize);
  chunks.push(chunk);
}

Using other approaches

You can use other approaches for improving LLM responses. For example, LLM fine-tuning allows you to train an existing LLM using a smaller more specific dataset. Use this approach when:

You want to change how the model responds or behaves
You have a large enough dataset for training
You need the model to learn new patterns or styles

Next steps

Learn how to implement a chatbot with the AI SDK and deploy it on Vercel with the RAG agent guide.
Explore Vercel templates for RAG based applications

AI Cloud

Core Platform

Security

Company

Open Source

Tools

Use Cases

Users

What is Retrieval Augmented Generation (RAG)

Calling the LLM directly

Example chat with direct call

Including the complete documentation

Adding RAG to the process

Retrieval step

Augmented generation step

Example chat with RAG

Common RAG workflow

Vector storage system

Client chat system

Optimizing your RAG implementation

Filter results with similarity thresholds

Limit context size

Structure your source data

Use overlapping chunks

Using other approaches

Next steps