top of page
Arash Heidarian

Optimizing LLMOps: Crafting the Right Architecture and RAG Strategy

Updated: Oct 15


RAG

Table of contents


Introduction

New "Ops" are emerging like mushrooms these days! And that's not a bad thing—it’s a clear sign that more processes are moving toward operationalization and automation phases. I’ve previously published a few posts about what MLOps is. Now, with the increasing demand and attention on Large Language Models (LLMs), a new space in the world of Ops has opened up, called LLMOps. But what exactly is LLMOps? Is it just a fancy name for MLOps that deals with operationalizing LLMs? Well, the answer is, “it depends.” LLMOps can be designed in various ways, depending on your business objectives.

We’ll explore this further in future blog posts. For now, let’s take a general look at what LLMOps is! In short, it’s about personalizing and then productionizing an LLM so that it can retrieve the most relevant information from your specific domain or knowledge base.In this blog, I’ll explore what LLMOps is and how to design an optimal architecture that aligns seamlessly with your business needs.


LLM Hallucination

LLMs like ChatGPT, LLAMA, and BERT have been trained on vast amounts of data and are excellent tools for tasks such as summarization, interpretation, and extracting the information you need. However, things get tricky when the model lacks knowledge about the specific topic you're asking about. In these cases, it may start to "hallucinate" and generate random information that sounds quite convincing! What’s even more concerning is when LLMs fail to acknowledge their knowledge gaps and continue producing irrelevant or inaccurate responses.

For example, in Figure 1, I’ve included a screenshot of a conversation I had with ChatGPT, where I asked about my document similarity measure, TS-SS. Despite it being cited about 60 times in academic papers on Google Scholar and discussed in various blog posts, ChatGPT still couldn’t provide the correct information.



Hallucinated ChatGPT and LLM Models
Figure 1. Hallucinated ChatGPT

LLMs can hallucinate so severely that at times, it feels like they’ve been teleported to an entirely imaginary world. Naturally, if you don’t provide your own data to an LLM, it won’t have the correct information—and it may even mislead you. Let’s look at a simple use case to illustrate this.

Imagine in a health insurance company, employees need to use a LLM model to assess a claim from customers.

Obviously you need to provide the company's policy documents to the LLM, so it can respond back with relevant information. The problem is, that it still can somehow use information, which are not linked to the provided documents, simply because it could not find the most relevant information from your document. It also can happen due to poor query entered by customer service, the person who has filled the claim paper, or employee who is assessing the claim. Lets explore how this concern can be addressed, in order to avoid any hallucinated judgments by LLM.


Simple Fine-tuning

We obviously need to provide health insurance policy documents to the LLM we are using so that the model can read through them and extract the relevant information. You decide to submit all documents. Moreover, you may also need to provide some lengthy appendix documents (or books), which are mainly scientific an medical texts. It sounds like a perfect solution, doesn’t it? Yes, it is—if you don’t mind the cost! LLMs charge you based on tokens for both input and output, as well as additional fees for API calls. The cost will skyrocket when you provide 100 gigabytes of documents, not to mention the expenses associated with each prompt and API call!


Providing all documents to LLM models can be pricy and sometimes may take longer than expected to ingest data, and also process the queries.
Figure 2. Providing all documents to LLM models can be pricy and sometimes may take longer than expected to ingest data, and also process the queries.


Retrieval-Augmented Generation (RAG)

RAG!!! Here we go! Another funky name! Many of us have probably been using it for a while but have never assigned it a name. It is simply a natural language processing (NLP) technique that combines two key components: document retrieval and generation. This technique is designed to improve the accuracy and relevance of responses generated by LLMs while reducing costs by minimizing the amount of data passed to them. But how does it work? Let’s explore the two key components:

  1. Retrieval:

    This component involves extracting relevant information from a large external corpus or database. When a query or prompt is provided, the retrieval system searches the corpus to find the most pertinent documents or text passages. The retrieval can be based on various techniques, such as TF-IDF or dense embeddings produced by models like Word2Vec, BERT, or similar algorithms.

  2. Generation:

    After retrieving the relevant documents, the generation component—usually a large language model like GPT-3, GPT-4, or similar—uses this information to generate a response. The retrieved documents provide context and factual grounding, allowing the generation model to produce more accurate and contextually relevant text.

The retrieval process can actually be accomplished using free, off-the-shelf Python packages. Traditional NLP algorithms and techniques for information retrieval play a crucial role at this stage. They extract the most relevant documents from vast archives, which are then passed to the LLM for summarization. As a result, instead of submitting all documents to the LLM, which can be costly, the system selectively chooses and transmits a limited number of the most relevant documents along with the prompt.


Retrieval Augmented Generation (RAG)
Figure 3. Retrieval Augmented Generation (RAG)

Here’s a step-by-step explanation of how a RAG (Retrieval-Augmented Generation) system works, as illustrated in Figure 3:

  1. Input Query: The user provides a query or prompt.

  2. Document Retrieval: The system retrieves a set of relevant documents or text snippets from an external knowledge base or corpus. This retrieval can be performed using various methods, such as TF-IDF, BM25, or dense vector-based techniques like those produced by BERT or other embedding models.

  3. Document Scoring and Selection: The retrieved documents are ranked based on their relevance to the query. A subset of the highest-scoring documents is selected to provide context for the generation step.

  4. Response Generation: The selected documents are passed to a language model, which generates a response. The language model uses the context provided by the retrieved documents to create a more informed and accurate answer.

RAG not only helps save money but also improves accuracy. LLMs generate more appropriate and precise results when the provided domain knowledge is limited to relevant documents.


Information Retrieval (IR) & Document Selection

Lets deep dive into the information retrieval (IR) process of RAG. It is the most important and time consuming part of RAG.

Information retrieval and Document Selection is the heart of  RAG process
Figure 4. Information retrieval and Document Selection is the heart of RAG process

Contrary to common belief, there is no one-size-fits-all solution for designing an effective RAG system, as the information retrieval (IR) process must always be tailored to the specific business and its objectives. I will cover different strategies needed for various business and scenario contexts. Before that, let’s take a brief overview of the overall process and see how it evolves.


Information retrieval and document selection process
Figure 5. Information retrieval and document selection process

Figure 5 outlines the typical stages involved in transforming documents into a format suitable for search and retrieval, which is a core component of RAG. If you have experience in NLP, you may choose to skip the descriptions below. However, if this is a new area for you, here is a brief introduction to what each of the blocks in Figure 5 represents. I plan to dedicate separate blog posts to each block in Figure 5, as each one represents a comprehensive area within NLP. Understanding these components in detail is crucial, as designing a RAG system is not a one-size-fits-all process; it needs to be tailored to each specific business context. Here is a brief introduction to these blocks in Figure 5:


Doc Preprocessing

This is an initial phase of working with any textual data.The purpose is to extract raw text documents for further processing by cleaning and standardizing the text through the following processes. Note that the list below does not imply that all of these processes must be applied to any type of NLP work; sometimes, it may be necessary to skip some of them:

  1. Language Detection: Identify the language of the document to apply language-specific preprocessing.

  2. Tokenization/Segmentation: Break down the text into smaller units, such as words, sentences, paragraphs, or even chapters.

  3. Lowercasing: Convert all text to lowercase to maintain uniformity.

  4. Remove Punctuation: Strip out punctuation marks, which often do not add meaning in many NLP tasks. However, it may be necessary to retain punctuation for further chunking or to capture the semantic meaning of sentences. Some libraries actually require them!

  5. Removing Stopwords: Eliminate common words (like "and," "the") that are not useful in understanding the core content.

  6. Stemming/Lemmatization: Reduce words to their base or root form.

  7. Removing Extra Whitespaces: Clean up any unnecessary spaces, tabs, or line breaks.

  8. Remove Tags and Hrefs/Links: Remove HTML tags and hyperlinks if the text was sourced from web pages.


Doc Chunking

Break down the documents into smaller, more manageable pieces or "chunks" to optimize for processing and retrieval. This step can be tricky and requires careful consideration. Later, I will demonstrate how to use a guide map I have designed to find the right strategy for chunking. Many argue that the accuracy of RAG can be quite volatile. The reason for this is that any RAG strategy found on the internet may not be suitable for your specific business needs. Chunking should be tailored to the business context. Typical approaches include:

  • Keep Original (No Slicing): Maintain the entire document as one chunk if slicing is unnecessary.

  • Big Chunks: Divide the document into large sections, which might represent paragraphs or distinct sections.

  • Small-Medium Chunks: Further divide the document into smaller parts, such as individual sentences or groups of sentences.

I will explore the chunking strategy in more detail later.


Doc to Vector

Convert text chunks into numerical vectors that can be utilized in machine learning models, particularly for similarity searches. The most well-known and typical methods include:


  • Statistical Methods: Techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) that represent text based on the frequency of words.

  • Neural Network-based Embedding: Advanced techniques like Word2Vec, GloVe, or BERT that convert words or phrases into dense vector representations, capturing their semantic meaning.


Search Process

The aim here is to find the most relevant and similar set of information to the query or prompt entered by the end user. Remember that the query or prompt can vary from a typical short sentence to a document (PDF or Word) consisting of multiple pages. The type of query or prompt is also a key factor in determining how the chunking (mentioned in the previous step) should be designed (I will explore this in more detail later).

Once the documents and the query/prompt are converted to vectors, similarity measures are employed to find the most similar documents (represented as vectors) to the query (also represented as a vector). The most popular techniques include:

  • Similarity Measure: Calculate how similar the query vector is to the document vectors (e.g., cosine similarity).

  • Search Technique: The method used to search the vector space, such as nearest neighbor search, which retrieves the most similar documents.


Document Chunking

Lets deep dive into Document Chunking. Considering in this post we focus on choosing right strategy to build RAG, we emphasize on this step, as this is the key step in how the entire framework will be built.  This step is crucial and requires careful consideration, as it directly influences the effectiveness of the RAG process. We need to Break down documents into smaller, more manageable pieces or "chunks" to optimize processing and retrieval. The accuracy of a RAG model can be highly sensitive to the chunking strategy employed, making this a pivotal aspect of the system’s design.

Step 2, Doc Chucking  in designing IR, is an essential process in building a right RAG.
Figure 6. Step 2, Doc Chucking in designing IR, is an essential process in building a right RAG.

It's important to understand that there is no one-size-fits-all approach to chunking; the strategy must be tailored to the specific needs and context of your business. Factors such as the nature of the content, the types of queries users will make, and the goals of the system all play a role in determining the most appropriate chunking method.

In Figure 7, I introduce a guide map I’ve designed to help you navigate the complexities of chunking and select the optimal strategy for your particular use case. This guide map will outline key considerations and provide a step-by-step approach to ensure that your chunking method aligns with your business objectives and the unique characteristics of your data.

Remember, the success of your RAG implementation hinges on getting this step right. A poorly designed chunking strategy can lead to inaccurate or irrelevant results, while a well-considered approach can significantly enhance the precision and reliability of your system.


Chunking strategy map
Figure 7. Chunking strategy map

Use-Case:

Let's say you have a health insurance company and the aim is to use AI for processing the insurance claims faster. Employers will use the system to process claims and understand the applications faster.


Query type

When we talk about a query or prompt, we are referring to what the user enters to ask the system to retrieve specific information.

If employers, who are the users of the AI system, limit their questions or queries to a few sentences, such as, “Show me all the policies related to blood cancer,” this is considered a short query. But what about a long query?

Now, let’s imagine that your employer wants to upload a complete application form and other related documents, then ask the system to find the most relevant policies for the given application. In this case, we are dealing with long queries! The input can consist of several paragraphs or multiple pages.


Expected Output

Output is obviously what we expect to get from the IR system we design. Remember, the output at this stage, is not exactly what we get from RAG. In fact, it is what we expect from IR to generate, which eventually will be used as prompt for LLM. So when we talk about expected output here, we mean, what do we expect to be passed to LLM?

Long & strict output: This means we expect our IR system to find the most relevant (top n) documents in response to the query. Returning to our health insurance example, let’s say the employer just wants to find top n documents which cover most of policies related to cancer.


Long & comprehensive output: This is particularly useful when we want to find the most relevant documents or sub-documents spread across multiple sources.

For instance, consider an employer in our health insurance company example who is looking for policies related to blood cancer, types of treatments, pre-existing conditions, and any other policies that can be indirectly linked to cancer. Another example of this type of output in our health insurance scenario would be a situation where an employer is seeking all policies dedicated to a particular topic. For instance, a possible query could be: “What policies cover heart-related diseases?” Heart disease, as a common concern, can be associated with many topics, including but not limited to heart surgery, medications, care, and various treatments.

This output encompasses all top relevant (top n) sections, which may consist of multiple paragraphs or pages. A section can be defined as a paragraph, chapter, or sub-chapter (separated semantically or by the structure of the document). The policy on how to define sections heavily depends on the business content, which will be discussed in greater depth when we examine chunking strategies.


Short/medium and elite output: This is considered the most concise and comprehensive output. It covers all the relevant (top n) sentences or paragraphs across the entire dataset (documents) by selecting the top relevant sentences, paragraphs, or segments from the dataset. Going back to our health insurance example, let’s say one of the employers is looking for a very generic topic, such as: “Give me the list of post-surgery medications covered under the Saver Package.” Health insurance packages include detailed policies for different types of surgeries and medications, and sometimes these policies are scattered across multiple sections. By choosing this type of output, the IR process provides a brief list of surgeries and relevant medications to the LLM.


Chunking Approach

By now, we should have clarified the expected types of input and output for the system we aim to build. By identifying these input and output types, we can determine the most appropriate chunking strategy for our use case. Now, let’s take a look at chunking strategies.

Original: This is the most straightforward and self-explanatory approach. In this case, we don’t need to apply any chunking; we’ll leave the documents as they are. Documents may vary from a few lines to hundreds of pages. Chunking or slicing is applied for the reasons discussed earlier regarding input and output types.

Big Chunks: Some may find the term "big" vague and unclear, especially when it comes to splitting textual data. When we refer to "big chunks," we aim to split each document into several sections. Each chunk can be defined as a paragraph, chapter, or sub-chapter (separated semantically or by the structure of the document). Consequently, each chunk may vary from a few paragraphs to multiple pages. Data scientists and AI architects should collaborate with experts who understand the nature of the documents so they can determine the criteria for building chunks in a way that ensures each chunk encompasses enough information in the context of the business goal. Each chunk should be comprehensive enough to sufficiently cover a particular topic of interest while remaining independent.

Small-Medium Chunks: This category encompasses small chunks (as small as one or a few sentences) to medium chunks (e.g., a paragraph). The splitters used for this approach are usually simple, such as end-of-line characters, dots, or white spaces. As explained in the earlier section on input and output, this is the most effective approach for selectively retrieving the most relevant information for a short query.


Human judgment and effort

Clearly, none of these strategies are perfect, especially if the business needs to implement multiple approaches. Remember, these are merely strategies for designing a Retrieval-Augmented Generation (RAG) system, aimed at optimizing information retrieval and providing adequate details to LLMs. This will help us reduce costs and minimize the chance of hallucinations in LLM models. Ultimately, humans should—or, more accurately, must—be involved in decision-making.

Governments and legal authorities are increasingly imposing restrictions and legal frameworks around decision-making involving AI systems, primarily requiring human involvement. If you’re interested in these regulations, I recommend reading my post about the EU AI Act.

To this end, we should understand what level of involvement and effort is required from users to make final decisions. Table 1 below provides some insights into the expected level of human involvement based on our input, output, and chunking strategies.

Required humans judgment on output
Figure 8. Required humans judgment on output.

The longer the output, the more human judgment is required. It takes more effort for the user to read lengthy outputs and arrive at a final decision. In contrast, short answers are like smoothies—abstract, easy to digest, and quick to consume. Consequently, it takes less time and effort to scrutinize and interpret a short output.


However, short answers may lack some background information, potentially leading to rushed decisions. Thus, they are more suitable for simple topics or for users who already possess sufficient experience and knowledge of the subject. For example, if you ask ChatGPT for a neural network code, it may only take a few seconds to copy and paste the code and even execute it, which saves a lot of time for data scientists. However, the user should have a deep understanding of how neural networks work, including hyperparameters, nodes, and hidden layers, so they can adjust the model to yield the desired results.


Longer answers are more appropriate when decision-making requires reviewing many dependent and interconnected topics. Ultimately, LLM models can assist the user with final suggestions, but human judgment and interpretation remain essential to ensure that the LLM's recommendations align with human reasoning.


Conclusion

In this exploration of Retrieval-Augmented Generation (RAG) systems, we’ve highlighted the essential elements involved in designing effective information retrieval (IR) systems that enhance the capabilities of large language models (LLMs).

Key takeaways include the importance of understanding the types of input and output expected from the system. Short queries may yield concise answers, while long queries necessitate comprehensive outputs that require greater human judgment. The chunking strategies—original, big chunks, and small-medium chunks—must be tailored to the specific needs of the business and the nature of the content.

Moreover, human involvement remains critical in the decision-making process. As AI technologies become more prevalent, integrating human oversight will ensure that outputs align with human reasoning and mitigate the risk of hasty conclusions.

By prioritizing these principles, organizations can effectively harness the power of RAG systems, improving both the accuracy and efficiency of their AI-driven solutions. As we move forward, the collaboration between AI technologies and human expertise will be vital for navigating the complexities of information retrieval in the digital age.

35 views

Recent Posts

See All

留言


bottom of page