Basic RAG is not so basic for Enterprise Data

[This article is also posted on Medium for commentary and interaction]

When it comes to LLM-powered chatbots and information retrieval, many demos and consumer-facing applications showcase the capabilities of Retrieval Augmented Generation (RAG) using a single or a few documents. With a simple drag-n-drop, you can upload a couple of documents and start asking questions about it.

Conceptually, this is what RAG is all about: provide the LLM with some additional information related to the specific query you’re pursuing, allowing it to base its answer on the facts you’ve provided and rely less on the likely incomplete or outdated “world knowledge” from its training data.

While these “Chat with your documents” capabilities are impressive — and very helpful when doing research and writing articles — they often don’t tackle the challenges faced in enterprise use cases. When dealing with the large enterprise repositories of contracts, agreements, policies and documentation, manual upload of individual documents is not sufficient.

It’s tempting to think that we’ll soon be able to just point our favorite consumer chatbot to our entire Dropbox, OneDrive or Google Drive file hierarchy, and the problem will be solved. But as you’ll see below, this unlikely to work reliable for enterprise data.

This article will highlight a few intricacies of implementing RAG in enterprise scenarios and explore some solutions, including how data management techniques combined with modern vector databases can help overcome these challenges. These are just examples, but will hopefully illustrate why RAG-based solutions most often require custom approaches for data management and AI.

In enterprise environments, documents such as contracts, agreements, and policies often exist multiple versions due to ongoing revisions, updates, and negotiations. Organizations dealing with higher volumes of such documents might have a proper Content Management System (CMS) with versioning in place. In other places, you might find these documents stored in regular file folders, with simple versioning of the file names indicating the history of revisions. Even more challenging, we sometimes find these documents spread out and duplicated over multiple directories, with potentially different versions, names and formats.

This creates a challenge for RAG systems, which need to identify and retrieve the most relevant or up-to-date version for a given query. Without proper version control or metadata, the system might inadvertently select or mix outdated document versions, most likely leading to inaccuracies or outright errors.

These issues are not due to “hallucinations” we can blame on the LLM; they are issues we caused by feeding the LLM with the wrong information.

These issues are not due to “hallucinations” we can blame on the LLM; they are issues we caused by feeding the LLM with the wrong information.

In some cases, the solution might be as simple as selecting only the latest “final” version of each document, which is then fed to and indexed by the RAG system. However, managing multiple versions inside the RAG system is sometimes necessary if we want to be able to ask our AI to compare changes to documents over time.

Solution: First, in the majority of cases custom data management is required in order to identify the “latest version” of a policy or the contract in effect at a certain date. Whether documents are stored in file folder, CMS systems or other enterprise applications, the business logic to extract the right documents will need to be defined based on the use cases at hand.

Second, inside the RAG application, modern vector databases offer functionalities that can significantly assist in managing chunks of multiple document versions. A good vector database can store metadata alongside vector embeddings, allowing for efficient version control of chunks with version numbers and timestamps.

Effective use of metadata in vector databases enables the application to identify and retrieve only the chunks of a specific version of a document.

Effective use of metadata in vector databases enables the application to identify and retrieve only the chunks of a specific version of a document.

For example, the Pinecone vector database offers prefixes to support versioning and more, whereas Weaviate similarly offers metadata with id and time stamps. This metadata can then be used in filters (alongside vector similarity) to retrieve relevant chunks to feed the LLM.

Contracts and agreements frequently use standardized language with generic terms like “Buyer” and “Seller.” This standardization can lead to challenges in distinguishing between different document chunks within a RAG system. Similar wording across multiple documents can result in indistinguishable chunks (even though they originate from completely different contract, e.g.), complicating the retrieval of the correct context for the LLM.

In the example below, we have three purchasing contracts, for three different clients, but the wording is identical except for the amounts and duration of this specific liability clause. If only a basic character-based chunking with overlap is done, the resulting chunks could become as shown in orange and green. Without any indication of which Buyer and which Seller each chunk is discussing, it’s virtually impossible for the LLM to generate an accurate answer if fed the 6 chunks shown (which might all match a question about “limited liability terms for partner ABC”).

Solution: Central to any solution is to ensure each chunk is unique and references the actual relevant entities (Buyer and Seller, in the case above) . This can be achieved in a variety of ways, here are a few approaches:

The title or name of the (contract) document can be infused as a pre-amble in each chunk (assuming that the document name or title accurately identifies the Buyer and Seller).
The clause itself can be modified by replacing generic references like “Buyer” and “Seller” with the actual names of the companies. This might be very effective, but it’s not always ideal to generate answers based on modified source documents.
Definitions such as “Buyer = Company ABC, Seller = Company XYZ” can be added as a prefix to the clause or stored as metadata associated with the chunks, similar to how versioning is handled.

The title or name of the (contract) document can be infused as a pre-amble in each chunk (assuming that the document name or title accurately identifies the Buyer and Seller).
The clause itself can be modified by replacing generic references like “Buyer” and “Seller” with the actual names of the companies. This might be very effective, but it’s not always ideal to generate answers based on modified source documents.
Definitions such as “Buyer = Company ABC, Seller = Company XYZ” can be added as a prefix to the clause or stored as metadata associated with the chunks, similar to how versioning is handled.

These approaches fall roughly in a category of methods called “contextual retrieval” in which context is added to chunks to make them easier to match up against a specific user query. For a more general overview of this, see this recent article on contextual retrieval from Anthropic.

All these solutions have some pros and cons, and the best one will likely depend case-by-case. The key lesson here is that while vector databases are great at capturing semantic relationships and similarities between words and phrases, identical wordings often found in contracts and policies can cause problems.

Ensuring that the vector database reflects the latest version of documents is crucial in a RAG system. Inconsistent data can lead to discrepancies in information retrieval, affecting the reliability of the RAG system.

It is often necessary to continuously update the database as new documents are added, existing documents are modified, or documents are deleted. Maintaining consistent data can be challenging, particularly with frequent updates of large amounts of information, where batch processing is insufficient.

Solution: Vector databases generally support solving these challenges by offering synchronization mechanisms based on metadata tagging. Each document chunk stored in the vector database is tagged with metadata that links it to its parent document, which in turn can be tagged with a reference to the external document.

By leveraging these capabilities, it’s possible to trigger updates upon modification to external document repositories, identifying the document (chunks) in the vector database that corresponds to the document that was modified. A typical resolution involves deleting a set of chunks, and adding fresh chunks from the updated document. This approach maintains data integrity and enables enterprises to effectively manage dynamic environments with frequent document changes.

In summary, while LLMs and Retrieval-Augmented Generation (RAG) offer impressive capabilities as demonstrated through consumer-facing applications like ChatGPT, Claude and Gemini, leveraging RAG in enterprise environments presents unique challenges.

These challenges include managing multiple document versions, handling similar wording across documents, and ensuring information is up-to-date through data synchronization. Enterprises need suitable data management strategies and leverage modern vector databases to address these complexities effectively.

Key solutions involve custom data management to identify the most relevant document versions and using metadata capabillity of vector databases to tag document chunks for efficient retrieval and dynamic refresh.