Productionizing the Domain-specific-Embeddings for RAG using Databricks and Langchain

Vamsi Krishna

April 10, 2024

Welcome to our blog series on Retrieval Augmented Generation (RAG). This series is a practical guide to build a production-ready RAG system. The main focus will be on how to select the right tools and how we went about our production-ready RAG Application.

Introduction

RAG is one of the most popular architectures to improve the factual hallucinations of Large Language Models (LLMs) and also improve the generation for domain-specific applications. The main components of RAG:

Retrieval
- Embeddings
- Vector Store
Generation
- LLM

Mostly the focus is on the choice of LLM. Should go with pre-trained model? or Fine-tune the LLM or train from the scratch. The fact is LLMs are just used to summarize the result retrieved by the vector store based on the query. If the retrieval is not good, the LLM response won't be accurate. It is important to focus on the retrieval part first and then slowly experiment on the LLM.

Why custom embeddings?

This brings us to vector store and embeddings. The choice of vector store is based on the performance and customization. If the embeddings are not generated right, the choice of vector store won't matter. For most production systems, general sentence embeddings are not useful. The business use-cases are highly domain-specific and requires detailed understanding of the context. That's why the first component to focus is embeddings. The more domain-specific the embeddings are, the better the text representation and the better the retrieval results.

Positive as well as negative

For creating a domain-specific embeddings, the domain related data as well as unrelated data is required. The embeddings must know which is context specific text and which isn't. Positive samples are grouped closer and negative samples will be grouped far away. This is how the context is generated.

‍

Fine-tuning and deploying the embeddings using MLflow

Once we prepare the data, the data is ingested to delta table using Autoloader. Autoloader is a databricks streaming service which uses structured streaming to load the files incrementally. Hugging face sentence embeddings model is used to fine-tune for domain-specific embeddings. MLflow is a library which can be used to capture the training logs and for registering the model. This helps in model management and makes the deployment process simple. Once the model is registered. MLflow models can be directly deployed as APIs in Databricks model serving platform.

Inheriting LangChain Base Embeddings class

LangChain is a popular library to build LLM apps. To use the LangChain RAG library, the embeddings used must be a LangChain object. Given that custom embeddings model is still a deployed API, it cannot be used for retrieval. To use the deployed API, the base embeddings class must be inherited, and the methods must be overridden with the new ones. Below code showcases how to inherit the base embeddings class and override the methods. Once the object is created, it can be used directly as an embedding to the retriever of the vector store.

from langchain.embeddings import Embeddings

class MyEmbeddings(Embeddings):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def embed_documents(self, texts):
        return list(map(lambda x: self.embed_query(x),texts))

    def embed_query(self, text):
        embeded_vector = custom_embeddings_api(text)
        
        return embeded_vector

‍

Conclusions

The quality of RAG is mostly dependent upon the embeddings used because they transform the raw text into vectors. If the vector representation is not good, the overall performance of RAG won't be great. That's why it is crucial to build custom embeddings which can improve the context. Databricks MLflow platform a very clean and neat way of deploying most machine learning model flavors. This helps in quick building a production ready model APIs. Once the model API are available, it can be used by any service. In this example, the model API was consumed by the LangChain embeddings class which can be used in retrieval. Once the embeddings are ready, the decision over the vector store can be taken. The next part of the blog will discuss on the full implementation of RAG architecture.