How to Build and Deploy Private Custom Chatbots
This article is focused on the private deployment of Open Source LLMs, if you are okay with using OpenAI models, check out this article.
With AI being on the rise, all businesses are integrating AI into their pipelines. LLMs like GPT-4, LLaMA, and Mistral have proven highly helpful for people in reducing the amount of redundant work and how much time it takes to finish simple tasks. Simply using ChatGPT can massively boost employee productivity across all tasks.
However, a massive problem that companies have with ChatGPT is the lack of privacy and sensitive data concerns. Even though OpenAI released ChatGPT enterprise, there are many industries that cannot share their data with any third party. Industries like the military and healthcare have a lot of sensitive data. And that’s why they need to privately deploy LLMs like GPT-4.
In this article, you will learn how to build and deploy private and secure LLMs in production environments. We will go over open source and private vector databases, private LLM deployment, and how you should design your API schema. We have years of experience in deploying LLMs and models to cloud servers, and everything we have learned you can find in this article.
Why build and deploy a private chatbot LLM?
Private deployment of an LLM can be messy and difficult to maintain, there must be good reasons to undertake this project. Here are some good reasons to deploy an LLM in a private cloud instead of going with OpenAI APIs:
Data Security
This is perhaps the most important common reason why people go with private deployments. As mentioned earlier, there are industries that cannot afford to share their data at all. For example, the military cannot use ChatGPT for work purposes as their information is very sensitive and usually classified as private. The same goes for healthcare professionals, even though OpenAI claims to be compliant, most medical doctors cannot afford to share such sensitive information with other companies.
This is where local and private deployments come in. Deploying an LLM privately can help with a ton of tasks like summarization and automating common tasks, without having to share the data with any other company.
Compliance
Many companies just don’t allow their data to be shared. This is usually because of compliance reasons. Industries like financial services and law enforcement are not allowed to share documents with third parties. We know that the finance industry can benefit a lot from these AI applications, for example, by gaining insights from complicated SEC filings.
A self-hosted LLM can completely remove these issues as there is no data sharing with any third parties at all and you can also track how you are using your data and how LLM is responding to it.
Finetuning
Another big reason other than data security and privacy is a more fine-grained control of the model itself. Even though models like GPT-4 and GPT-3.5 are highly capable, they cost a lot and can perform very poorly on very narrow and domain-specific tasks. This is where you would want to finetune a model. OpenAI now allows you to finetune GPT models, but the costs can get very high, very fast.
But if you finetune a LLaMA model, you can completely decide how the model works and behaves. Finetuning an open-source LLM can easily boost the performance, without any cost increases as you are still hosting the same model. The only thing that changes is the model weights, not the size. If anything, when finetuning, you can use even smaller models as you are training the model for a very narrow task instead of multiple tasks.
Costs
Lastly, but perhaps one of the most important reasons, cost. If you are going to use an LLM for your business, starting with third-party APIs makes sense as it is easy to develop and test. But as you start gaining users, the costs can get extremely high and as your prompt length increases, costs increase non-linearly.
One can easily finetune LLaMA and deploy them in a server or serverless setting to save a ton on deployment costs. Anyscale did a comprehensive study on the LLaMA vs GPT-4 costs and found that LLaMA can be 30x cheaper compared to GPT-4. You can read more here. Here’s a detailed statistic from the post:
How to deploy a chatbot in private infrastructure?
Now that we know what are some good reasons to deploy chatbots and LLMs in private cloud servers, let’s talk about how exactly you can do that. We will discuss how you can deploy LLMs and use them with open-source embeddings to use with your personal data.
Model
The model is the most important part of the stack. It largely depends on your needs and what kind of tasks you need it to perform. At the time of writing this, Mistral-7B is the most powerful and smallest model known. Mistral has 7 billion parameters, and most of the 13 billion parameters out there. Many models like OpenHermes2 are built on top of this model which are finetuned further on even more tasks and making it much better. You can go to our comparison of LLMs blog to learn more about these models and make a decision as to which model you want to use.
Our general rule of thumb is to use smaller models like Mistral if you need only to prompt the model for a single one-off output, like for generating summaries and translating text. This is because there is no need for the conversation history to be provided to the LLM. But when dealing with much longer contexts, we tend to use much bigger models like LLaMA-2-13B. Bigger models can handle longer contexts and multi-turn conversations much better compared to smaller models.
Embeddings
Embeddings are vector representations of text. These vector representations are learned by models at the time of training, or one can train models specifically to learn embedding representations of text too. These embedding vectors have a special property of encoding the input text in a way that texts with similar context or meaning will have their embedding vectors much closer to each other compared to the ones with different meanings.
This means if a text talks about earthquakes and another text is about underwater vibrations, those texts will be located very close to each other in the embedding space. Whereas a text about YouTube videos when turned into embeddings will be very far away from both of them.
These embeddings will help us find relevant documents to answer user queries.
SBERT OpenSource Embeddings
We use SBERT models to generate embedding vectors when working in a private infrastructure. Using OpenAI embeddings can be helpful in a development environment, but for private and secure settings we use SBERT embeddings.
Some of these models are as small as 90MB. This makes them very good for deployment and fast inference in production environments.
How to embed Images?
To process and store images as vector embeddings we can store the description text of the image. But if that is missing, the second best method is to use the text around the image as the description for the image, as the image will most likely be tightly connected to the text before and after it.
If you have a ton of images, considering CLIP models can also be a viable option. We wrote a detailed blog on how to search images using embeddings.
Private LLM hosting using vLLM
Now that embeddings are taken care of, we can move on to the most tricky part of the process, deploying the LLM. This is tricky because this can get very messy very quickly, simply because there are so many settings and little things to control. There are many options out there to optimize deployments of larger models, from Microsoft to Nvidia, and almost all of them provide options for deployment.
vLLM is an open-source project that implements many models for custom deployment and handling. We use vLLM as it’s completely open source and provides the best performance known. vLLM also gives a massive boost over plain HuggingFace implementations and more than 2x boost when compared to the Text Generation Inference project.
Here is a performance comparison for both LLaMA-13B and 7B models:
You can read more about vLLM and how they implement fast inference here.
Private Vector Databases
Along with private LLM hosting, we also need a vector database. Vector DB should also be hosted privately as that’s where the core and most of the data is stored. We have tried multiple databases in our stack, Chroma, Weaviate, Milvus and what not. And in our experience, we found that Milvus is the best choice.
Milvus is easily deployable using their docker containers and easily maintainable too. The microservice architecture makes sure it integrates with your stack without any issues. Milvus also provides easy vertical and horizontal scaling. It is specifically optimized for performance on large-scale applications and is blazingly fast even on large amounts of load.
Milvus uses a very scalable architecture underneath, that allows for very fast database queries even with trillions of embedding vectors. Here’s a diagram of the Milvus’s internal architecture:
Deployment options
Now that we have talked about all the components, let’s discuss what deployment options we have. There are many options like AWS, Modal, Runpod, etc. Whether you go with serverless or dedicated depends completely on how much you are okay with spending and how many users you are going to have. If you are going to have a ton of users requesting real time usage of the llm, serverless makes the most sense given that you would want auto-scaling. But if you want to run inputs in batches instead, both serverless and a dedicated GPU server make sense.
A dedicated GPU usually makes sense if you know how many requests you are going to serve per hour or minute and can properly accommodate that. But then you will either have to restrict the user requests or batch them together when the number of requests is more than expected.
API Design for LLM Deployment
One more thing to consider is the API design itself, it seems easy to handle and build, but when working with massive projects these things matter a lot. And even more when we are integrating the API with other products. Here are some very important factors to consider which we have learned solely through experimentation:
Streaming with Websockets
One very important part of this which we now see pretty much everywhere is the streaming responses. This is basically the process of directly streaming the responses to the frontend as soon as they start generating from the LLM, instead of waiting for the full response.
This streaming paired with a WebSocket connection can massively increase the performance of your application. Simply because a REST API can work but the connection time in between requests is very inefficient and can significantly slow down the system. Streaming is supported by vLLM out of the box so this is not something that you would have to implement yourself.
Session/State Management
State management is also a big issue. Our go-to decision is to always store as much as possible on the backend instead of the user. Simply because doing so gives us so much control over everything. Both the text files for contexts and the messages are always stored on the backend side of the application.
One big reason to do this is to restrict malicious control of the user on the application. If you tend to store the messages on the user end, one can edit the context of the messages and the prompts being fed into the model, and this can lead to prompt injection attacks. But if the messages are stored on the backend instead, prompt injection attacks are simply not possible outside of the current user message being fed to the model.
Schema
Schema is also a very important component of the whole stack. OpenAI’s schema is perhaps the best schema out there. You can check it out here. This is the schema we suggest everyone use too. vLLM also supports this API schema out of the box, so no need to implement it yourself either.
Encryption
Encryption is the MOST important of the whole stack when it comes to building data-sensitive or private applications. We always store all the sensitive data on the backend and encrypt it properly. Not only that, all the messages between the server and the users are also fully encrypted. This might add some complexity to the stack, but this is necessary for building and deploying a privately hosted LLM.
Challenges
Even with all this, deploying your LLM without much experience can be very complicated. There are multiple moving parts and almost all of them are important. This introduces many challenges to the system, let’s discuss them in more detail.
Costs
Costs are a big issue when it comes to deployment of any sort. Especially for compute-intensive LLM deployments. Our suggestion is to go with serverless platforms like Modal or Runpod. These can provide big savings, at the cost of complexity. Serverless can become harder to manage, but it is always worth it if implemented properly.
GPU Server Setup
If you are using a GPU server, the project can become really complicated very quickly. GPUs are tricky to handle, and when paired with large models and the need for fast deployments, they can become a really big issue. There are way too many options to consider and settings to control, from GPU computing capability to VRAM, and all of them are important. We suggest using a prebuilt library like vLLM to reduce issues with GPUs.
Usage Estimations
Usage estimations are something that completely depends on your application and users. Although with LLM-based applications these calculations can be really complicated. LLMs are largely dependent on the prompts and prompt lengths. The estimation where most people make mistakes is the prompt length and the number of messages in a chat.
Every message of 50 tokens is not a 50 tokens input to your LLM, it’s a 50 tokens + previous messages input to your LLM, hence the prompt size and compute cost increase almost non-linearly.
Want to Build and Deploy a Private GPT-4 like LLM?
Many businesses are deploying LLMs privately within their ecosystem to preserve their data. If you are looking for similar data secure solutions, reach out to us! We have years of experience in developing AI applications and we have deployed multiple LLMs in different environments. Just reach out and we will be happy to help!