A RAG chatbot starter for expert answers on university degrees. Built using Vercel AI SDK, LangChain, Upstash Vector, and OpenAI.
[!NOTE] This project is a Community Project.
The project is maintained and supported by the community. Upstash may contribute but does not officially support or assume responsibility for it.
DegreeGuru is a project designed to teach you making your own AI RAG chatbot on any custom data. Some of our favorite features:
This chatbot is trained on data from Stanford University as an example, but is totally domain agnostic. We've created this project so you can turn it into a chatbot with your very own data by simply modifying the crawler.yaml
file.
For local development, we recommend forking this project and cloning the forked repository to your local machine by running the following command:
git clone git@github.com:[YOUR_GITHUB_ACCOUNT]/DegreeGuru.git
This project contains two primary components: the crawler and the chatbot. First, we'll take a look at how the crawler extracts information from any website you point it to. This data is automatically stored in an Upstash Vector database. If you already have a vector database available, the crawling stage can be skipped.
The crawler is developed using Python, by initializing a Scrapy project and implementing a custom spider. The spider is equipped with the parse_page
function, invoked each time the spider visits a webpage. This callback function splits the text on the webpage into chunks, generates vector embeddings for each chunk, and upserts those vectors into your Upstash Vector Database. Each vector stored in our database includes the original text and website URL as metadata.
To run the crawler, follow these steps:
[!TIP] If you have docker installed, you can skip the "Configure Environment Variables" and "Install Required Python Libraries" sections. Instead you can simply update the environment variables in docker-compose.yml and run
docker-compose up
. This will create a container running our crawler. Don't forget to configure the crawler as explained in the following sections!
If you don't already have an Upstash Vector Database, create one here and set 1536 as the vector dimensions. We set 1536 here because that is the amount needed by the embedding model we will use.
The following environment variables should be set:
# Upstash Vector credentials retrieved here: https://console.upstash.com/vectorUPSTASH_VECTOR_REST_URL=****UPSTASH_VECTOR_REST_TOKEN=****# OpenAI key retrieved here: https://platform.openai.com/api-keysOPENAI_API_KEY=****
To install the libraries, we suggest setting up a virtual Python environment. Before starting the installation, navigate to the degreegurucrawler
directory.
To setup a virtual environment, first install virtualenv
package:
pip install virtualenv
Then, create a new virtual environment and activate it:
# create environmentpython3 -m venv venv# activate environmentsource venv/bin/activate
Finally, use the requirements.txt
to install the required libraries:
pip install -r requirements.txt
After setting these environment variables, we are almost ready to run the crawler. The subsequent step involves configuring the crawler itself, primarily accomplished through the crawler.yaml
file located in the degreegurucrawler/utils
directory. Additionally, it is imperative to address a crucial setting within the settings.py
file.
The crawler.yaml has two main sections: crawler
and index
:
crawler:start_urls:- https://www.some.domain.comlink_extractor:allow: '.*some\.domain.*'deny:- "#"- '\?'- aboutindex:openAI_embedding_model: text-embedding-ada-002text_splitter:chunk_size: 1000chunk_overlap: 100
In the crawler
section, there are two subsections:
start_urls
: the entrypoints our crawler will start searching fromlink_extractor
: a dictionary passed as arguments to scrapy.linkextractors.LinkExtractor
. Some important parameters are:allow
: Only extracts links matching the given regex(s)allow_domains
: Only extract links matching the given domain(s)deny
: Deny links matching the given regex(s)In the index
section, there are two subsections:
openAI_embedding_model
: The embedding model to usetest_splitter
: a dictionary passed as arguments to langchain.text_splitter.RecursiveCharacterTextSplitter
settings.py
file has an important setting called DEPTH_LIMIT
which determines how many consecutive links our spider can crawl. A high value lets our crawler visit the deepest corners of a website, taking longer to finish with possibly diminishing returns. A low value could end the crawl before extracting relevant information.
If pages are skipped due to the DEPTH_LIMIT
, Scrapy logs those skipped URLs for us. Because this usually causes a lot of logs, we've disabled this option in our project. If you'd like to keep it enabled, remove the "scrapy.spidermiddlewares.depth"
from the disable_loggers
in degreegurucrawler/spider/configurable.py
file.
That's it! 🎉 We've configured our crawler and are ready to run it using the following command:
scrapy crawl configurable --logfile degreegurucrawl.log
Note that running this might take time. You can monitor the progress by looking at the log file degreegurucrawl.log
or the metrics of your Upstash Vector Database dashboard as shown below.
[!TIP] If you want to do a dry run (without creating embeddings or a vector database), simply comment out the line where we pass the
callback
parameter to theRule
object inConfigurableSpider
In this section, we'll explore how to chat with the data we've just crawled and stored in our vector database. Here's an overview of what this will look like architecturally:
Before we can run the chatbot locally, we need to set the environment variables as shown in the .env.local.example
file. Rename this file and remove the .example
ending, leaving us with .env.local
.
Your .env.local
file should look like this:
# Redis tokens retrieved here: https://console.upstash.com/UPSTASH_REDIS_REST_URL=UPSTASH_REDIS_REST_TOKEN=# Vector database tokens retrieved here: https://console.upstash.com/vectorUPSTASH_VECTOR_REST_URL=UPSTASH_VECTOR_REST_TOKEN=# OpenAI key retrieved here: https://platform.openai.com/api-keysOPENAI_API_KEY=
The first four variables are provided by Upstash, you can visit the commented links for the place to retrieve these tokens. You can find the vector database tokens here:
The UPSTASH_REDIS_REST_URL
and UPSTASH_REDIS_REST_TOKEN
are needed for rate-limiting based on IP address. In order to get these secrets, go to Upstash dashboard and create a Redis database.
Finally, set the OPENAI_API_KEY
environment variable you can get here which allows us to vectorize user queries and generate responses.
That's the setup done! 🎉 We've configured our crawler, set up all neccessary environment variables are after running npm install
to install all local packages needed to run the app, we can start our chatbot using the command:
npm run dev
Visit http://localhost:3000
to see your chatbot live in action!
You can use this chatbot in two different modes:
To turn streaming on/off, navigate to src/app/route/guru
and open the route.tsx
file. Setting returnIntermediateSteps
to true
disables streaming, setting it to false
enables streaming.
To customize the chatbot further, you can update the AGENT_SYSTEM_TEMPLATE in your route.tsx file to better match your specific use case.
Congratulations on setting up your own AI chatbot! We hope you learned a lot by following along and seeing how the different parts of this app, namely the crawler, vector database, and LLM, play together. A major focus in developing this project was on its user-friendly design and adaptable settings to make this project perfect for your use case.
The above implementation works great for a variety of use cases. There are a few limitations I'd like to mention:
UpstashVectorStore
used with LangChain currently only implements the similaritySearchVectorWithScore
method needed for our agent. Once we're done developing our native LangChain integration, we'll update this project accordingly.A RAG chatbot starter for expert answers on university degrees. Built using Vercel AI SDK, LangChain, Upstash Vector, and OpenAI.
[!NOTE] This project is a Community Project.
The project is maintained and supported by the community. Upstash may contribute but does not officially support or assume responsibility for it.
DegreeGuru is a project designed to teach you making your own AI RAG chatbot on any custom data. Some of our favorite features:
This chatbot is trained on data from Stanford University as an example, but is totally domain agnostic. We've created this project so you can turn it into a chatbot with your very own data by simply modifying the crawler.yaml
file.
For local development, we recommend forking this project and cloning the forked repository to your local machine by running the following command:
git clone git@github.com:[YOUR_GITHUB_ACCOUNT]/DegreeGuru.git
This project contains two primary components: the crawler and the chatbot. First, we'll take a look at how the crawler extracts information from any website you point it to. This data is automatically stored in an Upstash Vector database. If you already have a vector database available, the crawling stage can be skipped.
The crawler is developed using Python, by initializing a Scrapy project and implementing a custom spider. The spider is equipped with the parse_page
function, invoked each time the spider visits a webpage. This callback function splits the text on the webpage into chunks, generates vector embeddings for each chunk, and upserts those vectors into your Upstash Vector Database. Each vector stored in our database includes the original text and website URL as metadata.
To run the crawler, follow these steps:
[!TIP] If you have docker installed, you can skip the "Configure Environment Variables" and "Install Required Python Libraries" sections. Instead you can simply update the environment variables in docker-compose.yml and run
docker-compose up
. This will create a container running our crawler. Don't forget to configure the crawler as explained in the following sections!
If you don't already have an Upstash Vector Database, create one here and set 1536 as the vector dimensions. We set 1536 here because that is the amount needed by the embedding model we will use.
The following environment variables should be set:
# Upstash Vector credentials retrieved here: https://console.upstash.com/vectorUPSTASH_VECTOR_REST_URL=****UPSTASH_VECTOR_REST_TOKEN=****# OpenAI key retrieved here: https://platform.openai.com/api-keysOPENAI_API_KEY=****
To install the libraries, we suggest setting up a virtual Python environment. Before starting the installation, navigate to the degreegurucrawler
directory.
To setup a virtual environment, first install virtualenv
package:
pip install virtualenv
Then, create a new virtual environment and activate it:
# create environmentpython3 -m venv venv# activate environmentsource venv/bin/activate
Finally, use the requirements.txt
to install the required libraries:
pip install -r requirements.txt
After setting these environment variables, we are almost ready to run the crawler. The subsequent step involves configuring the crawler itself, primarily accomplished through the crawler.yaml
file located in the degreegurucrawler/utils
directory. Additionally, it is imperative to address a crucial setting within the settings.py
file.
The crawler.yaml has two main sections: crawler
and index
:
crawler:start_urls:- https://www.some.domain.comlink_extractor:allow: '.*some\.domain.*'deny:- "#"- '\?'- aboutindex:openAI_embedding_model: text-embedding-ada-002text_splitter:chunk_size: 1000chunk_overlap: 100
In the crawler
section, there are two subsections:
start_urls
: the entrypoints our crawler will start searching fromlink_extractor
: a dictionary passed as arguments to scrapy.linkextractors.LinkExtractor
. Some important parameters are:allow
: Only extracts links matching the given regex(s)allow_domains
: Only extract links matching the given domain(s)deny
: Deny links matching the given regex(s)In the index
section, there are two subsections:
openAI_embedding_model
: The embedding model to usetest_splitter
: a dictionary passed as arguments to langchain.text_splitter.RecursiveCharacterTextSplitter
settings.py
file has an important setting called DEPTH_LIMIT
which determines how many consecutive links our spider can crawl. A high value lets our crawler visit the deepest corners of a website, taking longer to finish with possibly diminishing returns. A low value could end the crawl before extracting relevant information.
If pages are skipped due to the DEPTH_LIMIT
, Scrapy logs those skipped URLs for us. Because this usually causes a lot of logs, we've disabled this option in our project. If you'd like to keep it enabled, remove the "scrapy.spidermiddlewares.depth"
from the disable_loggers
in degreegurucrawler/spider/configurable.py
file.
That's it! 🎉 We've configured our crawler and are ready to run it using the following command:
scrapy crawl configurable --logfile degreegurucrawl.log
Note that running this might take time. You can monitor the progress by looking at the log file degreegurucrawl.log
or the metrics of your Upstash Vector Database dashboard as shown below.
[!TIP] If you want to do a dry run (without creating embeddings or a vector database), simply comment out the line where we pass the
callback
parameter to theRule
object inConfigurableSpider
In this section, we'll explore how to chat with the data we've just crawled and stored in our vector database. Here's an overview of what this will look like architecturally:
Before we can run the chatbot locally, we need to set the environment variables as shown in the .env.local.example
file. Rename this file and remove the .example
ending, leaving us with .env.local
.
Your .env.local
file should look like this:
# Redis tokens retrieved here: https://console.upstash.com/UPSTASH_REDIS_REST_URL=UPSTASH_REDIS_REST_TOKEN=# Vector database tokens retrieved here: https://console.upstash.com/vectorUPSTASH_VECTOR_REST_URL=UPSTASH_VECTOR_REST_TOKEN=# OpenAI key retrieved here: https://platform.openai.com/api-keysOPENAI_API_KEY=
The first four variables are provided by Upstash, you can visit the commented links for the place to retrieve these tokens. You can find the vector database tokens here:
The UPSTASH_REDIS_REST_URL
and UPSTASH_REDIS_REST_TOKEN
are needed for rate-limiting based on IP address. In order to get these secrets, go to Upstash dashboard and create a Redis database.
Finally, set the OPENAI_API_KEY
environment variable you can get here which allows us to vectorize user queries and generate responses.
That's the setup done! 🎉 We've configured our crawler, set up all neccessary environment variables are after running npm install
to install all local packages needed to run the app, we can start our chatbot using the command:
npm run dev
Visit http://localhost:3000
to see your chatbot live in action!
You can use this chatbot in two different modes:
To turn streaming on/off, navigate to src/app/route/guru
and open the route.tsx
file. Setting returnIntermediateSteps
to true
disables streaming, setting it to false
enables streaming.
To customize the chatbot further, you can update the AGENT_SYSTEM_TEMPLATE in your route.tsx file to better match your specific use case.
Congratulations on setting up your own AI chatbot! We hope you learned a lot by following along and seeing how the different parts of this app, namely the crawler, vector database, and LLM, play together. A major focus in developing this project was on its user-friendly design and adaptable settings to make this project perfect for your use case.
The above implementation works great for a variety of use cases. There are a few limitations I'd like to mention:
UpstashVectorStore
used with LangChain currently only implements the similaritySearchVectorWithScore
method needed for our agent. Once we're done developing our native LangChain integration, we'll update this project accordingly.