RAG with ChromaDB

RAG with ChromaDB#

This page shows an example of how the Imagine SDK can be used with LangChain for retrieval-augmented generation. Retrieval augmented generation is a type of information retrieval process. It modifies interactions with a large language model so that it responds to queries with reference to a specified set of documents, using it in preference to information drawn from its own vast, static training data. This allows LLMs to use domain-specific and/or updated information.

Initial setup#

Before running any code listed below, you need to install the following dependencies:

pip install chromadb langchain_community

And let’s assume that we have a directory called books with some documents in plain text format:

> ls books
declaration_of_independence_of_the_united_states.txt  state_of_the_union.txt ...

You can download this free-to-use text file about the State of the Union to use on this example.

Imports and configuration#

Let’s start with all the imports we will need and some configuration parameters:

import os

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_core.messages import HumanMessage

from imagine.langchain import ImagineChat, ImagineEmbeddings

# Full path to our books directory
books_dir = "/path/to/my/books"

# Full path to where we will create the vector store database
store_name = "my_vector_db"
db_dir = f"/path/to/{store_name}"

RAG core functions#

These are the core functions to perform RAG:

create_documents: Creates the documents from the directory containing the text files.
create_vector_store: Creates the vector store from the documents.
query_vector_store: Queries the vector store and retrieves the relevant documents.

Let’s declare the three of them:

# Create documents from all the files in the directory
def create_documents(books_dir):
    if not os.path.exists(books_dir):
        raise FileNotFoundError(f"The directory {books_dir} does not exist. Please check the path.")

    book_files = [f for f in os.listdir(books_dir) if f.endswith(".txt")]
    
    documents = []
    for book_file in book_files:
        file_path = os.path.join(books_dir, book_file)
        loader = TextLoader(file_path)
        book_docs = loader.load()
        for doc in book_docs:
            doc.metadata = {"source": book_file}
            documents.append(doc)

    text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0, separator='\n')
    docs = text_splitter.split_documents(documents)
    return docs

# Create documents from all the files in the directory
def create_vector_store(docs, embeddings, store_name):
    persistent_directory = os.path.join(db_dir, store_name)
    print(f"Persistent directory: {persistent_directory}")
    if not os.path.exists(persistent_directory):
        print(f"\n--- Creating vector store {store_name} ---")
        Chroma.from_documents(docs, embeddings, persist_directory=persistent_directory)
        print(f"--- Finished creating vector store {store_name} ---")
    else:
        print(
            f"Vector store {store_name} already exists. No need to initialize.")

# Query Vector store given the store name, query and embedding function
def query_vector_store(store_name, query, embedding_function, k = 2, threshold = 0.1):
    persistent_directory = os.path.join(db_dir, store_name)
    if os.path.exists(persistent_directory):
        db = Chroma(persist_directory=persistent_directory, embedding_function=embedding_function)
        
        retriever = db.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"k": k, "score_threshold": threshold},
        )
        
        relevant_docs = retriever.invoke(query)
        return relevant_docs
    else:
        print(f"Vector store {store_name} does not exist.")

Ingestion phase#

Next, we are going to process each document, break it down into chunks, embed each chunk, and store it in the vector store.

# Parse documents and generate chunks
docs = create_documents(books_dir)
print(f"Number of document chunks: {len(docs)}")

Number of document chunks: 121

# Generate embeddings and persist in the vector store
embeddings_fn = ImagineEmbeddings()
create_vector_store(docs, embeddings_fn, store_name)

Persistent directory: /home/heyia/code/imagine-sdk-python/examples/langchain/rag/db/my_vector_db

--- Creating vector store my_vector_db ---
--- Finished creating vector store my_vector_db ---

Query phase#

In this phase, given a query, we retrieve relevant documents for that query from the vector store and then send them to the LLM for summarization.

query = "How can I learn more about LangChain?"
relevant_docs = query_vector_store(store_name, query, embeddings_fn, k = 3)

print("\n--- Relevant Documents ---")
for i, doc in enumerate(relevant_docs, 1):
    print(f"Document {i}:\n{doc.page_content}\n")

combined_input = (
    "Here are some documents that might help answer the question: "
    + query
    + "\n\nRelevant Documents:\n"
    + "\n\n".join([doc.page_content for doc in relevant_docs])
    + "\n\nPlease provide an answer based only on the provided documents. If the answer is not found in the documents, respond with 'I'm not sure'."
)

--- Relevant Documents ---
Document 1:
LangChain: A Framework for LLM-Powered Applications
LangChain is a powerful and flexible framework designed to simplify the development of applications that harness the capabilities of large language models (LLMs). It provides a wide range of tools, abstractions, and integrations that help developers build, customize, and optimize applications that leverage LLMs for tasks like text generation, question answering, summarization, chatbots, and more.
Key Features and Benefits
Modular Components: LangChain offers a variety of modular components (chains, agents, tools, prompts, memory, etc.) that can be easily combined and customized to build complex LLM-powered workflows.
Data Integration: It seamlessly integrates with various data sources, enabling applications to access and process external information, enhancing the context and relevance of LLM responses.

Document 2:
Agent Frameworks: LangChain provides agent frameworks that allow LLMs to interact with their environment, make decisions, and take actions based on user input or specific goals.
Memory Management: It includes memory components that enable applications to maintain context and track conversations, leading to more coherent and personalized interactions.
Prompt Engineering: LangChain facilitates prompt engineering, the process of crafting effective prompts to elicit desired responses from LLMs, by offering templates and tools for experimentation.
Chain Optimization: It provides mechanisms to evaluate and optimize chain performance, ensuring that applications deliver the best possible results.
Use Cases
LangChain empowers developers to create a wide array of applications, including:
Chatbots and Conversational Agents: Build intelligent chatbots capable of understanding natural language and providing informative responses.

Document 3:
Don't Forget to Like and Subscribe!
If you're looking for in-depth tutorials and insights into LangChain, CrewAI, and other AI technologies, be sure to check out the fantastic YouTube channel by Brandon Hancock:
YouTube Channel: https://www.youtube.com/@bhancock_ai
Don't forget to like and subscribe to his channel!!

# Send to LLM for summarizing
model = ImagineChat(model="Llama-3-8B")
messages = [HumanMessage(content=combined_input)]
result = model.invoke(messages, max_tokens = 2048, repetition_penalty=1.1, temperature=0.1, top_k=50, top_p=0.95)

print("\n--- Generated Response ---")
print(result.content)

--- Generated Response ---
Based on the provided documents, here's what I've learned about LangChain:

* LangChain is a framework designed to simplify the development of applications that use large language models (LLMs).
* It provides a range of tools, abstractions, and integrations to help developers build, customize, and optimize LLM-powered applications.
* Key features include modular components, data integration, agent frameworks, memory management, prompt engineering, and chain optimization.
* Use cases for LangChain include building chatbots and conversational agents, as well as other applications such as text generation, question answering, and summarization.

As for learning more about LangChain, the document mentions a YouTube channel by Brandon Hancock (@bhancock_ai) that provides in-depth tutorials and insights into LangChain and other AI technologies.