RAG Architecture Pattern Explained

RAG Architecture Pattern Explained

David Pitt AI, Architecture, Articles, RAG Leave a Comment

The AI revolution is well underway, and the potential is unavoidably obvious when you use a chat app like ChatGPT by OpenAI. While machine learning has been around for years, the convergence of Moore’s Law and the invention of the GPT algorithm served as a pivotal catalyst, sparking global fascination. ChatGPT allowed everyone to see the power of AI. It’s a game-changing app that is redefining the future of automation.

Other killer, game-changing applications of the past include the spreadsheets and word-processing apps of the 80s and 90s. Some may add relational SQL databases to the list, though it’s more of a technology than an app. The next truly game-changing app was the Web browser in the late 90s, and it’s still prevalent today.

When these apps came out, people flocked to take advantage. In the present day, we see the same thing happening with OpenAI’s ChatGPT or, more specifically, the Generative Pre-Trained Transformer algorithm driving AI chatbots and agents. It’s a powerful tool, so businesses are eager to harness its power.

The Challenges of Implementing Custom AI Models

Unfortunately, there’s a big hurdle that stands in the way of businesses wanting to use AI themselves: the computing resource requirements required to train and produce a Large Language Model (LLM).

Let me break it down. A trained GPT-3 model has 175 billion parameters. It’s essentially a 175 billion row input vector into multiple feed-forward neural networks – a giant matrix with 175 billion rows and a large number of cells. Clearly, this data structure is too big to be completely memory resident, so partitioning and sharding strategies are used to make it available. Data for training is essentially all the content available on the web, so you can see that it is computationally and energy-intensive to train.

A commercially available LLM like GPT-3 is trained on a supercomputer and can take weeks or months to train from scratch. This means data in a trained model is not real-time. To demonstrate, I asked ChapGPT the question below. Check out the response.

AI prompt and response without RAG

Additionally, commercial LLMs do not have access to enterprise or private data, so this knowledge will not get trained into the model. As an example, here’s what ChatGPT answers when prompted for Keyhole Software’s vacation policy.

AI and LLM without RAG

Since Keyhole’s data (i.e. employee handbook) is not available publicly, it can’t be ingested by the GPT model training process, which explains the generic response from ChatGPT.

How can an organization utilize this game-changing AI technology? They can implement their own GPT-based LLM and include their private data sources in the training process. However, as you might assume based on the discussion above, a custom LLM is a very expensive and time-consuming endeavor.

This is where the Retrieval Augmented Generation (RAG) architectural pattern comes into play.

What is RAG Architecture?

Retrieval-Augmented Generation (or RAG for short) is an AI architecture pattern that combines a retrieval module or source with a pre-trained generative AI model. The retrieval module searches for and retrieves relevant information from a knowledge base, database, or external documents. Then, once the retrieval module provides the context, a generative AI model (like OpenAI’s GPT or similar large language models) uses that context to generate a natural language response or complete a task.

Related Posts:  The Foundations of SAFe: Part I (Overview + Values)

RAG makes it easier for businesses to implement AI by reducing the need for extensive training or custom models. Instead of building and training a model from scratch, businesses can use their existing knowledge bases—like databases, FAQs, or internal documents—to provide accurate, real-time information through a pre-trained AI model. This approach is cost-effective, fast to deploy, and adaptable. It’s an ideal solution for companies that need domain-specific expertise without the complexity and expense of traditional AI development.

Businesses in industries like healthcare, finance, customer service, and tech can especially benefit from RAG. It’s perfect for those with frequently changing information or large repositories of data, as updates can be made to the knowledge base without retraining the AI. Whether it’s building a chatbot for customer support, streamlining internal processes, or creating an intelligent search tool, RAG enables companies to deploy tailored AI solutions quickly and efficiently, helping them stay competitive and meet the growing demand for intelligent automation.

RAG Design

Assuming an organization has API access to a GPT-based LLM like chatGPT, a custom chatbot UI can be created and augmented with organizational data elements, such as documents, images, wikis, emails, and other structured or unstructured data sets. However, to make this work effectively, the organization’s data must be stored or made accessible in a searchable format. This ensures that relevant information can be retrieved and used to provide context-specific responses.

The retrieved context-based data is included in a prompt sent to the GPT-based LLM, making it part of the chat’s context. The LLM then uses this context to generate a more accurate and relevant response.

Here’s a simple conceptual diagram of this architecture.

RAG Architecture diagram

This diagram is oversimplified, but it should give you a rough idea of the concept. The true complexity of RAG is in the “Search Data” requirement. We’ll dive deeper into Search Data later on, but first, let’s take a look at an example.

RAG Example – You Can Try This Out!

The example below will demonstrate the utility RAG can provide just with access to a trained LLM – in this case, ChatGPT. The following prompt asks about Keyhole Software’s vacation policy but augments the prompt with information about the vacation policy.

A quick disclaimer, this is not Keyhole’s actual vacation policy. In a true RAG environment, the vacation policy would come from a data search of the organization’s human resource documents, handbook, portal, etc. I made up this policy for the sake of brevity in our example.

Submitting the augmented prompt to ChatGPT produces this.

With this context established, further questions can be asked.

RAG enhances AI prompt and retrieval

You can see how augmenting ChatGPT with enterprise-specific data opens the door to making it more relevant and useful for an organization. Engineering firms could augment with specifications, pricing, and other data, Healthcare companies could introduce anonymized patient information, and the list goes on.

The Importance of Data Search

Now, let’s talk about Data Search. Data Search is the process of retrieving relevant information from a structured or unstructured knowledge base to enhance the output of a generative AI model. It acts as the “retrieval” component in the RAG pipeline, ensuring that the generative AI has access to accurate and contextually appropriate data when generating responses.

Data Search is a critical component that bridges the gap between static, pre-trained AI models and the dynamic, real-world data businesses rely on.

Of course an internal-data-only LLM could be created to fill this requirement, but most organizations may not have enough data to make it feasible. In any case, an internal-data-only LLM would probably be overkill and expensive to prop up. So, here are some alternative options.

Related Posts:  GenAI in the Enterprise: Mark McKelvey, Co-Founder of Stacked Analytics

Text Search

Text Search is the process of finding and retrieving relevant text from a knowledge base to provide context. It works by analyzing a user’s query and locating the most pertinent text using techniques like keyword matching. This retrieved text is then incorporated into the AI’s prompt, ensuring that responses are accurate, grounded in factual information, and relevant to the query.

Many solutions exist for this; Apache Lucene is a good example. A Lucene search engine indexes data and provides a robust way to query and search for relevant data elements. Text Search works well for more structured data found in data repositories.

Semantic Search

Semantic Search takes Text Search a step further. It attempts to provide more accurate relevant data based upon “semantic context,” rather than relying on matching keywords and phrases found in the data. It uses machine learning techniques, such as embedding models, to represent both the query and the text in a shared vector space, enabling the system to identify content that is contextually similar, even if the wording differs.

Semantic Search is harder to set up and support, but it allows for natural language-type processing. It works well for unstructured data repositories.

Vector Search

Vector Search is a method of information retrieval that uses mathematical representations (vectors) to capture the meaning and context of data, such as text, images, or other content. Each piece of data is converted into a vector using embedding models, creating a position in a multi-dimensional space. Queries are also converted into vectors, and the system finds the most relevant matches by measuring their proximity (like cosine similarity) to the query vector. This allows for more nuanced and context-aware retrieval compared to traditional keyword-based search.

Vector Search is best used when the goal is to find results based on meaning rather than exact wording. For example, it excels in retrieving contextually similar documents, answering open-ended questions, or surfacing related content in multimedia datasets.

The Bottom Line

All of these solutions require effort to establish, care for, and feed. However, they typically demand less time and fewer resources compared to implementing a full-scale LLM. Additionally, businesses can start small by creating a custom mechanism to index relevant enterprise data, which can then be programmatically retrieved and used to enhance AI chatbot prompts. This approach allows for gradual adoption while still leveraging AI effectively.

RAG Is Taking Hold

It does not take much imagination to see how RAG can make LLM’s more concise and efficient to apply to countless domains. Major computing platforms such as Azure and AWS are supplying cloud-based search utilities that support the RAG practice. ThoughtWorks is another company worth mentioning; check out their reasoning why RAG should be high on your list to adopt.

In the software development space, RAG solutions are being used to enhance productivity and streamline processes. By indexing and searching the entire codebase of a project, these solutions provide relevant, context-aware results from an LLM. Instead of simply suggesting code snippets, RAG-based tools can analyze and generate entire use case implementations across multiple programming languages. The result is a significant boost in productivity, enabling software teams to work faster and more efficiently.

At Keyhole Software, we’ve been experimenting with RAG to understand its full potential. If you’re interested in exploring how RAG can drive efficiency, automation, and cost savings for your organization, give us a call—we’re here to help you implement it.

5 1 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments